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SUMMARY 


Computers  that  understand  speech  are  expected  to  facilitate  natural  man- 
machine  interaction,  but  the  problems  involved  demand  the  attention  of  several 
disciplines  including  linguistics,  computer  systems  design,  perception  theory, 
speech  research,  and  engineering.  Linguistic  and  perceptual  arguuents,  in  part¬ 
icular,  suggest  that  devices  which  recognize  speech  will  have  to  make  use  of 
grammatical  structure  ("syntax")  in  early  stages  of  the  recognition  procedures. 
This  can  be  accomplished,  in  part,  by  using  certain  acoustic  features,  called 
the  prosodic  features,  to  segment  the  speech  into  grammatical  phrases,  and  to 
identify  those  syllables  that  are  given  prominence,  or  stress,  in  the  sentence 
structure. 

Prosodic  features  (which  include  the  durations  of  vowels  and  consonants, 
and  time-varying  measures  of  the  rate  of  vibration  of  the  talker's  vocal  cords 
and  the  energy  in  the  speech)  also  provide  some  cues  to  the  grammatical  categor¬ 
ies  of  phrases,  the  semantic  associations  between  phrases,  and  the  positions  of 
reliable  data  for  determining  the  sound  sequence  and  word  content  of  the  speech. 

Research  described  in  this  report  is  concerned  with  developing  methods  for 
detecting  stressed  syllables  and  the  boundaries  between  grammatical  phrases, 
using  prosodic:  features.  A  speech  recognition  strategy  is  outlined  which  begins 
with  detecting  boundaries  between  phrases,  by  finding  positions  where  the  acous¬ 
tic  data  show  a  substantial  decrease  in  the  rate  of  vocal  cord  vibration  (that 
is,  the  "voice  fundamental  frequency"),  followed  by  an  increase  in  fundamental 
frequency.  Once  the  connected  speech  is  thus  segmented  into  phrases,  the  strategy 
calls  for  locating  the  stressed  syllables  in  each  phrase.  Then,  analysis  of 
reliable  distinguishing  features  of  the  vowels  and  consonants  within  the  stressed 
syllables  is  attempted.  Speech  sounds  are  expected  to  be  more  clearly  articulated 
and  easier  to  distinguish  in  stressed  syllables,  than  in  unstressed  or  slurred 
syllables,  where  articulation  (and  consequent  acoustic  data)  is  not  as  precise 
or  consistent  from  talker  to  talker  or  time  to  time. 

All  the  facilities  for  implementing  this  general  strategy  have  not  been  im¬ 
plemented.  A  computer  program  for  detecting  boundaries  between  phrases  succeeds 
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for  about  80  to  90%  of  the  expected  boundaries,  but  also  gives  some  "false" 
boundaries  that  are  not  apparently  associated  with  syntactic  structure.  Im¬ 
provements  of  this  program  are  being  planned,  using  energy  contours  and  some 
detailed  refinements  in  the  test  for  substantial  changes  in  voice  fundamental 
frequency. 

Facilities  have  been  implemented  for  analyzing  the  acoustic  speech  signal 
to  determine  the  resonances  of  the  vocal  tract  (formants)  that  may  indicate  the 
vowels  or  consonants  intended  by  the  talker0  These  formant-monitoring  facilities 
use  frequency  spectra  derived  from  a  recent  analysis  technique  call  linear 
prediction,  which  extracts  the  transfer  function  of  the  talker's  vocal  tract, 
de-emphasizing  the  harmonic  structure  of  the  vocal  cord  excitation  source. 

Plots  of  the  frequency  of  formants  versus  time  are  obtained  from  the  smoothed 
frequency  spectra  provided  by  linear  predictor  analysis. 

A  program  has  been  implemented  for  obtaining  voice  fundamental  frequency  and 
two  measures  of  energy  in  the  speech  wave.  These  may  be  plotted  versus  time, 
or  displayed  on  an  interactive  cathode  ray  tube  terminal  of  the  computer  system. 

A  procedure  for  locating  stressed  syllables  would  represent  a  major  component 
in  the  strategy  for  analyzing  distinguishing  features  of  speech  sounds  in  stressed 
syllables  within  a  constituent.  Such  a  procedure  for  locating  stressed  syllables 
cannot  be  devised,  however,  until  more  is  known  about  the  acoustic  features  that 
mark  presence  of  stressed  syllables  in  connected  speech. 

Experiments  are  being  performed  to  study  stress  patterns  in  a  portion  of  a 
short  text  called  the  "Rainbow  Passage."  This  text  has  been  used  extensively 
in  studies  of  prosodic  patterns  in  speech,  and  has  the  advantage  of  being  a 
well-known  semantically-connected  text  of  declarative  sentences,  with  a  variety 
of  grammatical  phrase  structures.  The  experiments  with  the  Rainbow  Script  are 
designed  to  interrelate  theoretical  linguistic  predictions,  actual  perceptual 
judgments,  and  acoustic  data  about  stressed,  unstressed  and  reduced  (i.e,, 
incompletely  articulated)  syllables  in  the  connected  speech.  A  grammatical  ana¬ 
lysis  and  use  of  published  English  stress  riles  will  be  done  to  provide  theore¬ 
tical  predic+ions  about  stress  patterns.  An  acoustic  analysis  has  been  performed 
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on  the  speech  of  six  talkers  reading  the  Rainbow  Script,  yielding  contours  of 
fundamental  frequency,  energy,  spectral  data,  and  formant  values  versus  time.  This 
acoustic  data  must  yet  be  analyzed,  to  determine  which  acoustic  features  corre¬ 
late  well  with  predicted  stress  patterns  and  with  perceptual  judgments  of  stress. 

When  listeners  heard  clauses  or  sentences  in  the  Rainbow  Script  repeated  at 
will  (by  rewinding  and  replaying  a  tape',  they  were  able  to  distinguish  individual 
stressed,  unstressed,  and  reduced  syllables.  Results  differed  little  from  talker 
to  talkei,  or  from  listener  to  listener.  Most  differences  that  did  occur  indi¬ 
cated  a  difficulty  in  clearly  distinguishing  between  unstressed  and  reduced 
syllables.  One  representative  listener  repeated  the  test  several  times,  and 
demonstrated  general  repeatability  of  results.  In  one  repetition,  a  computer  was 
used  in  digitizing,  storage,  and  replay,  in  place  of  the  usual  tape-rewind  method. 
Under  these  conditions,  the  listener  believed  he  could  detect  two  levels  of 
stresssed  syllables  ("highly  stressed"  and  "lesser  stressed"),  besides  unstressed 
and  reduced  levels. 

About  half  the  syllables  were  judged  as  stressed,  while  somewhat  fewer  were 
judged  as  reduced,  and  fewer  than  one  quarter  were  judged  as  unstressed. 

Further  perception  tests  are  yet  to  be  made,  with  other  listeners  repeating 
the  test,  marking  all  syllables,  and  with  repetitions  to  test  consistency. 

Studies  of  the  relationships  among  the  acoustic,  perceptual,  and  linguistic 
data  must  be  performed.  Further  tests  are  also  being  planned,  using  speech  texts 
which  are  now  being  designed  to  isolate  and  study  individual  effects  of  position 
in  the  sentence,  grammatical  phrase  structure,  semantic  structure,  and  phonetic 
content.  Univac  will  also  be  evaluating  speech  data  recorded  by  contractors  who 
are  building  speech  understanding  systems  for  ARPA.  This  is  one  of  many  scheduled 
activities  designed  to  integrate  prosodic  information  into  other  programs  on 
total  speech  understanding  systems. 
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1.  INTRODUCTION 

This  is  a  report  on  work  currently  in  progress  in  the  Univac  Speech 
Communications  Group,  under  contract  with  the  Advanced  Research  Projects 
Agency  (ARPA).  As  a  part  of  ARPA's  total  program  in  research  on  speech  under¬ 
standing  systems,  the  research  reported  herein  is  concerned  with  extracting 
reliable  prosodic  and  distinctive  features  information  from  the  acoustic  wave¬ 
form  of  connected  speech  (sentences  and  discourses).  Studies  are  being  concen¬ 
trated  on  problems  of  detecting  stressed  syllables  and  syntactic  boundaries. 

The  traditional  model  of  speech  recognition  has  assumed  that,  by  tracking 
the  right  "information-carrying"  parameters,  and  using  any  of  several  phonemic- 
segment  classification  techniques,  one  could  determine  phonemic  strings  corres¬ 
ponding  to  those  intended  by  the  talker.  Then,  the  phonemic  strings  may  be 
applied  to  higher-level  linguistic  analyses  to  determine  words,  phrases,  and 
utterance  meanings. 

At  Univac,  work  on  automatic  speech  recognition  (ASR)  lias  progressed  along 
a  different  approach.  The  viewpoint  is  that  vei satile  speech  recognition  will 
proceed  by  making  use  of  reliable  information  in  the  acoustic  data  in  combination 
with  early  use  of  linguistic  regularities.  As  will  be  outlined  in  this  report, 
recognition  is  to  be  accomplished  by  using  prosodicaliy-detected  stress  patterns 
and  syntactic  structure  in  aiding  a  partial  distinctive-feature-estimation 
procedure.  Prosodicaliy-detected  syntactic  structure  will  also  be  used  to  aid 
syntactic  parsers  and  semantic  processors. 

Prosodic  cues  to  sentence  structure,  and  prosodic  aids  to  the  location  of 
reliable  acoustic  phonetic  information,  have  been  given  little  or  no  attention  in 
previous  speech  recognition  efforts.  The  strong  motivations  for  the  use  of  pro¬ 
sodic  patterns  in  speech  recognition  procedures  will  thus  be  presented  in  some 
detail  in  section  2«  Versatile  facilities  for  extracting  prosodic  features, 
spectral  data,  and  formants,  and  a  program  for  detecting  boundaries  between 
syntactic  phrases  (constituents),  will  be  described  in  section  3,  Initial  ex¬ 
periments  to  be  described  in  section  4  are  being  conducted. to:  determine  the 
acoustic  correlates  of  stress  and  constituent  boundaries;  determine  listeners' 
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abilities  to  perceive  stressed,  unstressed,  and  reduced  syllables;  and  derive 
theoretical  linguistic  predictions  about  stress  patterns  and  syntactic  boundaries 
in  English  sentences,,  These  experiments  are  being  performed  on  a  well-known 
connected  text  called  the  "Rainbow  Script"  (Fairbanks,  1940),  but  further  studies 
will  be  conducted  later  on  texts  specifically  designed  to  isolate  interfering 
factors  in  prosodic-phonetic-syntactic  interaction  in  connected  speech.  In 
section  5,  efforts  to  design  good  speech  texts  are  described,  along  with  efforts 
to  integrate  prosodic  studies  with  research  on  other  aspects  of  total  speech 
understanding  systems. 
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2.  MOTIVATION  FOR  PROSODIC  AIDS  TO  SPEECH  RECOGNITION 

Speech  is  man’s  most  natural,  universal,  and  familiar  form  of  communication. 
It  has  many  advantages  which  have  been  shown  to  apply  to  man-machine  interac¬ 
tion  as  well  as  to  human  communication  (Lea,  1968;  1970).  Among  the  most  diffi- 
r-ni  t.  problems  involved  in  speech  communication  between  man  and  computer  is  the 
computer  recognition  of  speech.  Speech  recognition  might  be  defined  as  the 
process  of  transforming  the  continuous  acoustic  speech  signal  into  discrete  repre¬ 
sentations  which  may  be  assigned  proper  meanings,  and  which,  when  comprehended  in 
a  total  speech  understanding  system,  may  be  used  to  affect  responsive  behavior. 

2. 1  Arguments  Favoring  Prosodic  Cues  to  Syntactic  Structure 

Early  work  on  speech  recognition  was  concerned  with  pattern  matching  on 
isolated  words,  achieved  by  direct  comparison  of  input  spectral  data  with  stored 
spectral  patterns  (or  "templates")  obtained  from  previous  processing  of  the  words 
in  the  vocabulary.  Later  work  acknowledged  the  phoneme  as  a  recognizable  segment. 
Word  recognition  was  to  be  done  by  recognizing  phoneme  strings  as  constituting 
words.  Probabilities  of  phoneme  sequences  were  expected  to  help  increase  the 
accuracy  of  word  recognition  algorithms. 

As  interest  in  recognition  of  continuous  speech  developed,  the  general  pro¬ 
blem  of  speech  recognition  was  regarded  as  being  composed  of  two  parts:  "a  pri¬ 
mary  recognition  based  solely  on  the  sound  shapes  of  the  acoustic  signal,  a  sec¬ 
ondary  recognition  of  the  linguistic  (grammatical  and  syntactic)  content  based 
on  the  (presumably  phonemic)  output  of  the  primary  recognition  level"  (Lindgren, 
1965).  The  prevalent  hope  was  that  one  could  segment  the  acoustic  stream  into 
moderate-sized  discrete  atoms  (phonemes,  diphones,  or  such),  which  could  be  inde¬ 
pendently  recognized. 

2.1.1  Linguistic  Arguments 

There  are  two  faulty  assumptions  implicit  in  such  a  hope.  One,  often  called 
■the  linearity  condition,  asserts  that  there  should  be  a  distinguishable  segment 
in  the  speech  wave  for  each  abstract  (phonemic)  segment,  and  if  abstract  segment 
A  precedes  abstract  segment  B  in  the  abstract  linguistic  string,  then  the  time 
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segment  associated  with  A  must  precede  that  for  B.  The  other  assumption,  called 
the  invariance  condition,  asserts  that  all  the  distinguishing  features  of  an 
abstract  segment  must  be  present  (within  the  segment’s  time-stretch)  for  each 
occurrence  of  that  segment,  while  the  set  of  all  such  features  values  should 
not  occur  for  other  abstract  segments.  As  Chomsky  and  Miller  (1963,  p.  31 l) 
noted,  "If  both  the  invariance  and  linearity  conditions  were  met,  the  task  of 
building  machines  capable  of  recognizing  the  various  phonemes  in  normal  human 
speech  would  be  greatly  simplified".  However,  violations  of  invariance  and 
linearity  abound.  For  example,  the  distinction  between  the  words  ladder  and 
latter  (phonemically,  /la taS/  vs  /latt^*/),  which  is  phonemically  in  the  third 
segment,  physically  occurs  often  in  the  lengthening  of  the  second  phonetic 
stretch  of  sound  (phonetically,  la*»Drf  vs  l»Dtf).  This  violates  both  the  lin¬ 
earity  and  invariance  conditions.  Any  coarticulation  process,  or  context  de¬ 
pendency,  whereby  a  (nearby  or  distant)  phoneme  causes  changes  in  the  disting¬ 
uishing  acoustic  properties  of  a  given  phoneme,  would  similarly  violate  the  two 
conditions. 

Another  example  of  violations  of  linearity  and  invariance  concerns  an  ad¬ 
ditional  way  in  which  voicing  of  consonants  is  marked  in  the  acoustic  data.  Voic¬ 
ing  of  some  voiced  consonants  is  not  always  evidenced  by  the  expected  continuous 
periodic  vibration  of  the  vocal  cords  throughout  the  consonant.  Both  voiced  and 
unvoiced  consonants  may  have  initial  "voiced"  portions  of  their  closure  period 
during  which  the  vocal  cords  vibrate  periodically,  followed  by  "unvoiced"  portions 
during  which  periodic  vibration  does  not  occur.  When  such  discontinuities  in  voc¬ 
al  cord  vibration  occur,  voicing  is  not  determined  by  features  within  the  closure 
period  associated  with  the  consonant.  Secondary  features  outside  the  time  stretch 
of  the  consonantal  closure,  such  as  the  initial  rate  of  change  of  the  fundamental 
frequency  of  vocal  cord  vibration  within  the  following  vowel,  must  be  used  to 
establish  phonemic  voicing  (cf.  Stevens,  1971;  Lea,  1972a,  Ch.  4).  As  noted  by 
the  example  above,  voiced  consonants  also  cause  a  lengthening  of  the  duration 
of  the  preceding  vowel.  Consequently,  a  speaker  has  the  option  of  producing 
voiced  stops  and  fricatives  without  actually  vibrating  his  vocal  cords  contin¬ 
uously,  provided  he  increases  the  duration  of  the  preceding  vowel,  or  otherwise 
supplies  cues  somewhere  within  the  utterance  as  to  the  Q+voicedJ  state  of  the 
consonant. 
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Stress  and  intonation  are  other  linguistic  factors  that  show  marked  physical 
effects  in  one  segment  (vowel  or  syllable)  due  to  surrounding  segments.  Viola¬ 
tions  of  linearity  and  invariance  occur  so  frequently  that  the  linguists  have 
written  quite  general  phonological  rules  (such  as  the  one  for  lengthening  of 
vowels  before  voiced  consonants,  or  the  context-dependent  stress  rules)  to  capture 
such  generalizations.  Because  of  the  structural  redundancy  provided  by  the 
listener's  linguistic  knowledge,  a  speaker  does  not  have  to  encode  into  the  acous¬ 
tic  waveform  all  of  the  features  describing  an  utterance,  and  the  features  that 
he  does  choose  to  encode  can  vary  from  one  repetition  of  a  given  utterance  to 
the  nexto  A  listener  will  be  able  to  fill  out  the  distinctive  features  matrix 
for  a  word  ho  has  heard,  knowing  only  some  of  the  matrix  elements  and  using  his 
knowledge  of  the  structure  of  the  language.  For  example,  in  the  feature  matrix 
representation  of  the  single  morpheme  word  "slump"  shown  in  Figure  1,  a  total  of 
39  matrix  elements  is  used  to  specify  the  five  phonemes.  However,  if  full  use 
is  made  of  the  structure  of  English,  the  24  unshaded  matrix  elements  can  be 
derived  from  knowledge  of  only  the  15  that  are  shaded  in.  To  do  so  in  this 
example,  one  would  utilize  the  facts  that  /s/  is  the  only  sound  that  can  precede 
an  initial  /l/,  and  that  if  a  single  morpheme  word  has  a  final  consonant  cluster 
beginning  with  a  nasal,  the  following  consonant  must  share  place  of  articulation. 
Of  course,  the  number  of  features  necessary  for  identification  could  be  less  than 
15  if  one  was  dealing  with  a  restricted  and  limited  lexicon,  and  considerably  less 
if  the  word  occurred  in  a  sentence  or  phrase  that  further  limited  the  number  of 
choices  available  to  a  listener. 

The  combination  of  linguistic  and  lexical  structure,  and  multiple  cues  for 
some  features,  allows  a  speaker  to  thus  be  imprecise  and  inconsistent  in  his 
production  and  still  be  clearly  understood.  The  net  result  is  that  in  addition 
to  the  fact  that  the  encoding  of  phonemic  and  prosodic  information  into  the 
acoustic  waveform  is  a  complex  one  involving  overlapping  in  time  and  environment¬ 
al  dependence,  the  encoding  itself  is  often  performed  incompletely  and  with  con¬ 
siderable  variability.  Indeed,  in  some  utterances,  whole  phonemes  or  syllables 
may  be  "missing"  from  the  pronunciation.  A  speech  recognition  system  based  on 
acoustic  manifestation  of  all  phonemes  or  all  distinctive  features  would  thus 
frequently  fail. 
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Figure  1.  Distinctive  features  matrix  for  "slump."  Features  that  play 
no  role  in  describing  a  particular  phoneme  are  left  blank.  The  signi¬ 
ficance  of  the  shading  is  explained  in  section  2.1.1. 
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These  and  other  arguments  (cf.  Chomsky  and  Miller,  1963;  Lea,  1972a;  IN 
PRESS;  Medress,  1972)  against  invariance  and  linearity  in  small  speech  units 
dispel  any  notions  that  recognition  based  on  simple  concatenation  of  categorised 
time  segments  can  be  completely  successful.  Phonemic  context  has  thus  been 
recognized  as  necessary  to  fill  in  acoustically  "unspecified"  distinctive  fea¬ 
tures  in  the  representation  of  received  utterances,  and  to  even  fill  gaps  for 
"missing"  phonemes  in  some  pronunciations  of  connected  speech. • 

However,  linguists  also  argue  (cf.  reviews  by  Lea,  1972a,  b)  that  phonemic 
recognition,  or  distinctive  features  estimation,  cannot  even  by  accomplished  with 
the  use  of  phonemic  context  and  known  phonetic  redundancies.  They  argue  that 
"in  general,  the  perceiver  of  speech  should  utilize  syntactic  cues  in  determining 
the  phonemic  representation  of  an  utterance"  (Chomsky  and  Miller,  1963,  p.  314; 
emphasis  added).  Chomsky  and  Halle  (1968,  pu  31 )  for  example,  have  developed 
a  detailed  set  of  phonological  rules  for  (phonetic)  stress  assignment,  along 
with  vowel  reduction  rules  and  other  phonological  rules,  which  depend  explicitly 
on  the  word  categories  and  phrase  structure  of  utterances.  Such  jules  are  assumed 
to  be  used  by  the  spf och  perception  system  in  relating  phonetic  data  to  linguistic 
structure. 

2*1.2  Perceptual  Arguments 

Psychologist  George  MLller  (1962)  also  has  sharply  criticized  the  view  that 
speech  recognition  should  be  achieved  by  first  deciding  what  phonetic  segments 
have  occurred,  then  determining  what  phonemes  and  morphemes  were  involved  based 
on  the  lower-level  phonetic  decisions,  and  so  on  up  to  larger  units  and  higher 
linguistic  levels.  He  gives  several  reasons  for  doubting  that  people  naturally 
operate  that  way,  (see  also  Chomsky,  1964,  PP«  106-114,  end  Flanagan,  1965, 
pp,  236-8,  for  other  arguments).  If  we  are  to  assume  that  a  recognizer  will 
exhibit  behavior  similar  to  that  of  the  human  perceiver,  we  may  also  doubt  the 
value  of  such  a  model  for  artificial  recognizers.  Miller  argues  (1962,  p.  81 ) 
as  follows: 

Phenomenologically,  it  seems  that  the  larger,  more  meaningful 
decisions  are  made  first,  and  we  pursue  the  details  only  so  far  as 
they  are  necessary  to  serve  our  immediate  purposes.. »  If  the 
small  details  of  input  are  discriminated  first,  how  5 r  it  possible 
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to  take  advantage  of  the  redundancy  of  the  message?...  we  must 
regard  the  decisions  reached  at  the  lower  levels  as  tentative  and 
subject  to  revision  pending  the  outcome  of  decisions  made  at  some 
higher,  more  molar  level.  Once  this  tentative  character  is  ad¬ 
mitted,  of  course,  it  becomes  necessary  to  continue  storing  the 
original  input  until  the  molar  decisions  have  been  reached. 

However,  if  complete  storage  is  necessary  even  after  the  lower- 
level  decisions  have  been  tentatively  reached  why  bother  to  make 
the  lower  decisions  first? 

Arguing  from  reaction  time  studies,  Miller  asserted  that  we  just  do  not 
have  enough  time  to  make  all  phonemic  decisions  at  the  rate  at  which  phonemes 
occur  in  speecho  He  estimated  that  about  one  categorical  decision  per  second 
might  occur  in  ordinary  listening,  and  concluded  that:  "If  we  accept  this  as 
a  rough  estimate,  it  suggests  that  the  phrase  —  usually  about  two  or  three 
words  at  a  time  —  is  probably  the  natural  decision  unit  for  speech"  (Miller, 
1962,  p.  81).  Miller  reported  some  experimental  results  that  supported  his 
conjectures. 

Other  perception  studies  have  confirmed  the  use  of  phrases  as  units,  at 
whose  boundaries  decisions- appear  to  be  made.  Johnson  (1965)  showed  that  the 
probability  of  an  error  in  remembering  the  next  word  in  a  sentence  increased 
significantly  at  phrase-structure  boundaries,  thus  indicating  that  sentences 
are  remembered  phrase -by-phrase.  Similar  studies  had  previously  been  made 
showing  that  probabilities  of  error  in  predicting  phonemes  increased  markedly 
at  word  and  phrase  boundaries.  Several  other  studies  showed  that  clicks  super¬ 
imposed  on  speech  were  perceived  as  occurring  near  certain  major  deep-structure 
syntactic  boundaries  within  the  sentence,  regardless  of  actual  timing  of  the 
clicks  within  the  speech  continuum  (cf.  review  by  Gleitman  and  Gleitman,  1970), 

It  has  been  suggested  that  the  perc-eiver  waits  until  the  end  of  such  phrase 
units  ("constituents")  before  making  decisions  as  to  the  sound  structure  content 
of  the  large  unito  The  timing  of  a  click  is  lost  since  decisions  as  to  its  phon¬ 
etic  significance  are  delayed  until  the  end  of  the  constituent,  at  which  time  its 
relationship  to  the  rest  of  the  sound  sequence  is  found  to  be  nil,  and  its 
relationship  to  the  time  structure  of  the  recognized  large  unit  cannot  be  est¬ 
ablished,  Fodor  and  Garrett  (1966)  in  particular,  noted  that  constituents  dom- 
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inated  by  a  sentence  node  (in  deep  structure)  yielded  particularly  regular 
click  displacement  to  their  boundaries.  The  title  of  a  recent  paper  by  Bever, 
Lackner,  and  Kirk  (1969),  nicely  summarizes  the  usual  interpretation  of  these 
studies:  "The  Underlying  Structures  of  Sentences  Are  the  Primary  Units  of 
Immediate  Speech  Processing". 

Such  results  suggest  that  people  generally  make  slow,  infrequent  decisions 
about  relatively  large  units  of  speech,  rather  than  many  fast  decisions  about 
small  units.  The  results  do  not,  of  course,  imply  that  phrases  are  the  only 
units  involved  in  perception  (cf.  Haggard,  1967). 

Miller  (1962)  observed  that  the  use  of  large  units  in  the  detection  of  the 
message  in  speech  is  not  surprising  in  the  light  of  studies  in  coding  theory. 
Messages  can  be  more  efficiently  encoded,  and  error-correcting  information  can 
be  introduced,  if  a  long  string,  or  long  segment,  is  stored  and  encoded  (and 
later  decoded)  as  a  unit0 

It  is  also  interesting  that  children,  as  primitive  speech-recognizing  sys¬ 
tems,  learn  intonational  cues  to  phrase  structure  and  sentence  type  before  acquir¬ 
ing  any  competence  with  the  specific  phonemics  of  their  language  community 
(Lieberman,  1967a;  Lewis,  1936;  Leopold,  1953).  On  another  hand,  Grimes  (1969) 
has  shown  that  field  lii  -uists  trying  to  perceive  the  structure  in  a  new  language 
benefit  from  early  use  of  large-unit  segmentation  into  "breath  groups"  and 
rhythmic  (sense  group  or  phrase)  units. 

If  even  phonetic  and  phonemic  decisions  involve  syntactic  information,  as 
suggested  in  section  2.1.1  above,  then  Miller  would  seem  to  be  right  in  suggest¬ 
ing  that  advantageous  use  of  the  redundancy  of  language  and  speedy  perception 
require  making  higher-level  decisions  about  large  decision  units  (such  as 
phrases)  before  firm  decisions  are  made  about  lower-level  phonemic  units.  One 
might  conclude  that  there  is  overwhelming  linguistic  and  perceptual  evidence 
suggesting  the  need  for  early  introduction  of  syntactic  hypotheses  in  recogni¬ 
tion  schemes. 
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2.1.3  Prosodic  Cues  to  the  Presence  of  Large  Linguistic  Units 

These  arguments  suggest  a  somewhat  novel  theory  of  speech  recognition,  using 
syntax  in  phonemic  decisions.  Speech  perception  then  involves  making  use  of 
certain  expectations  and  received  cues  to  determine  the  syntactic  structure  (and 
semantic  content)  of  an  utterance.  Given  a  hypothesis  as  to  the  surface  syn¬ 
tactic  structure,  the  perceiver  uses  phonological  principles  to  determine  a 
phonetic  shape.  The  hypothesis  will  be  accepted  if  its  associated  acoustic 
phonetic  shape  isn't  too  radically  different  from  the  acoustic  input  (Chomsky 
and  Halle,  1968). 


How  might  one  make  the  preliminary  syntactic  hypotheses  called  for  in  the 
early  stage  of  recognition  schemes,  without  depending  upon  a  preliminary  seg¬ 
mental  (phonemic)  analysis?  The  listener  must  presumably  be  using  some  cues 
in  the  acoustic  signal  to  guide  his  hypothesis-making.  What  acoustic  cues  or 
features  might  be  used?  Obviously  they  must  be  features  which  extend  throughout 
the  large  units  of  syntactic  structure,  or  they  must  be  localized  features  that 
mark  unit  boundaries,  centers  of  units,  or  some  such  critical  points  in  the 
structure  The  boundary-marking  features,  often  identified  as  ".junctures" 
(Peterson,  1963;  Delattre,  1965),  ais.iuncture  (Lieberman,  1967),  or  deliminative 
and  criminative  elements  (Trubetskoy,  1939,  p.  27),  signal  the  boundaries 
between  two  units  and  indicate  how  many  'units'  are  contained  in  a  particular 
sentence  or  other  extended  utterance.  Other  features  extending  over  a  large 
unit  may  provide  a  distinctive  function  which  identifies  the  class  of  a  unit 
and  distinguishes  that  unit's  category  from  other  possible  structural  categories. 


Among  the  features  that  are  knowr  to  provide  deliminative  and  distinctive 
markings  of  syntactic  units  are  the  prosodic  features.  (Other  features, 
such  as  the  allophonic  variations  between  word-initial  aspirated  stops  and  word- 
final  uraspirated  stops,  would  also  be  relevant. ) 

Prosodic  features  that  have  long  been  recognized  as  indicators  of  English 
constituent  structure  are  voice  fundamental  frequency  (abbreviated  as  Fq), 
speech  intensity,  and  the  relative  durations  of  phonetic  segments.  For  example, 
vowel  and  consonant  durations  are  known  to  increase  just  before  pauses  between 
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syntactic  units  (Allen,  1968;  Barnwell,  1971;  Mattingly,  1966).  Lieberman 
(1967,  pp.  152-3)  showed  that,  for  some  lower-level  syntactic  disjunctives,  such 
as  the  distinction  between  "light  housekeeper"  and  "lighthouse  keeper",  the 
time  interval  between  vowel  centers  is  a  reliable  cue  for  disjunctive  positions. 
Prosodic  features  also  closely  relate  to  stress  patterns,  which  in  turn  are 
closely  associated  with  the  syntactic  bracketing  and  syntactic  categories  in 
sentences  (Chomsky  ar  d  Halle,  1968;  Lea,  1972a,  Ch.  6). 


2,2  Prosodic  Cues  to  Boundaries  Between  Phrases 

For  decades,  linguists  have  claimed  that  intonation  (that  is,  the  perceived 
variations  in  the  pitch  of  the  talker)  indicates  the  immediate  constituent  struc¬ 
ture  (i.e.,  "surface  structure")  of  English  sentences  (Jones,  1909;  1932;  pike, 
1945;  Hultzen,  1957;  1959;  Wells,  1947).  Trager  and  Smith,  wi^se  pitch  and 
stress  "levels"  (1951 )  are  -widely  used,  claimed  that  monitoring  voice  fundamental 
frequency  (Fq)  makes  it  possible  to  have  "solidly  established  objective  procedures" 
for  "the  recognition  of  immediate  constituents  and  parts  of  speech  syntax"  (1951, 
p,  77).  Yet,  they  did  not  say  exactly  how  to  use  intonation  for  structural 
analysis,  and,  until  recently,  no  such  "objective  procedures"  had  been  publicized. 
Gleason  (196I,  p.  169)  also  considered  intonation  and  stress  as  "the  dominant 
elements  in  the  syntax-signaling  system".  Study  of  metrical  patterns ’in  English 
verse  also  indicate  strong  markings  of  syntactic  boundaries  by  the  prosodic  fea¬ 
tures  (Keyser,  1969). 

Transformational  linguists  have  also  recognized  this  syntax-signaling  role 
of  intonation.  Lieberman  (1967,  p.  314)  asserted  that: 


"Intonation  has  a  central  role  in  the  transformational  recognition 
routines  that  the  listener  must  use  for  syntactic  analysis.  Intona¬ 
tion  provides  acoustic  cues  that  segment  the  speech  signal  in  o 
linguistic  units  suitable  for  syntactic  analysis. 


Stockwell  (I960)  noted  that,  "There  is  a  good  deal  of  evidence  ...  that  into- 
nation  patterns  are  the  absolutely  minimal  differentiators  of  numerous  utter¬ 
ance  tokens."  That  is,  intonation  helps  disambiguate  structurally  ambiguous 
utterances  by  indicating  their  bracketing  into  syntactic  units.  Bierwisch 
(1965)  demonstrated  that  it  is  possible  to  generate  an  intonation  contour  (for 
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a  German  sentence),  if  only  the  surface  syntactic  tree  and  related  syntactic 
information  is  pro-vided. 

While  there  has  teen  widespread  agreement  about  intonation  marking  some 
boundaries  between  sentence  parts,  there  has  been  considerable  dispute  about 
how  this  is  accomplished  (see  review  by  lea,  1972a,  section  1.4).  s™°  issues 

are  the  following:  (1)  What  features  of  the  voice  fundamental  frequency  contours 
mark  boundaries  between  subunits  of  sentences?  (2)  Which  sentence  portions 
(major  grammatical  constituents,  clauses,  all  syntactically  bracketed  units,  or 
arbitrary  sequences  of  words)  are  actually  demarcated  by  intonational  features. 

(3)  Are  syntactic  units  demarcated  by  intonation  patterns  in  all  utterances,  or 
only  when  the  talker  is  explicitly  trying  to  clarify  the  structure  of  structurally 

ambiguous  utterances? 

One  of  the  weakest  hypotheses  about  intonational  cues  to  sentence  structure 
(Armstrong  and  Ward,  1926)  is  that  sentences  mfflr  (but  need  not  necessarily:  be 
divisible  into  parts  by  intonation  contours  associated  with  any  arbitrary  (bu 
fairly  long)  sequences  of  syllables  or  words.  The  units  need  not  be  syntactic 
constituents,  and  indeed  the  individual  talker  may  divide  (or  not  divide)  a 
sentence  differently  from  time  to  time,  and  different  talkers  may  divide  utter¬ 
ances  differently.  At  the  other  extreme,  all  sentences  are  assumed  to  be  divisi¬ 
ble  into  syntactic  units  by  intonational  (or  other  prosodic)  cues  that  always 
occur  at  unit  boundaries  (Trager  and  Smith,  1951 !  Wells,  1947). 

Invariance  applied  to  such  prosodic  aspects  of  language  would  Imply  that 
a  syntactic  boundary  always  has  an  associated  acoustic  (or  phonetic)  boundary 
marker  manifested,  and  only  when  the  syntactic  boundary  occurs  will  that  acoustic 
marker  appear  (Trager  and  Smith,  1951,  p.  51 ).  Linearity  would  imply  that  a 
boundary  between  two  syntactic  units  would  be  manifested  by  acoustic  features  at 
the  time  stretch  ( 'pause'  or  such)  after  the  time  stretch  associated  with  the 
last  phoneme  of  the  earlier  constituent  and  before  the  time  stretch  associa  e 
with  the  phonemes  of  the  later  constituent. 

Malmberg  (1963,  p.  69)  implicitly  rejected  the  linearity  condition  for 
structural  boundaries,  lie  broke  up  utterances  into  "measures"  on  the  asis  of 


Report  No.  PX  7940 


UNIVAC 


perceived  intonation,  yielding  such  divisions  as  the  following: 

The  boys  are  - playing  in  the  -  street,, 

fc*  bolz  - pi  Ell  5  in  -  strit 

Each  measure  or  group  has  an  "accented"  (stressed)  syllable  and  zero  or  more 
unstressed  syllables.  The  breaks  he  shows  occur  just  before  the  stressed  sylla¬ 
bles,  and  not  necessarily  at  the  points  in  the  phonemic  string  where  structural 
boundaries  occur.  This  break-down  into  groups  is  exactly  what  is  obtained  from 
the  automatic  analysis  of  fundamental  frequency  contours,  to  be  described  in  Sec¬ 
tion  3.4*  That  is,  strict  linearity  must  be  rejected  if  one  is  to  succeed  in  find¬ 
ing  acoustic  cues  to  the  syntactic  breaks. 

Recently,  Robert  Scholes  (1971,  pp.  50-73)  investigated  whether  fundamental 
frequency,  peak  amplitude  in  syllabic  nuclei,  or  inter-vowel  intervals  provided 
the  best  cues  as  to  whether  a  syntactic  (subject-predicate)  boundary  occurred. 
Listeners  were  presented  with  three  contiguous  words  extracted  from  one  of  eight 
sentences  (read  by  any  pne  of  ten  talkers)  and  were  asked  to  judge  whether  a  sub¬ 
ject-predicate  boundary  occurred  between  the  first  and  second,  or  between  the  sec¬ 
ond  and  third,  words.  About  84$  of  all  sugject-predicate  boundaries  were  perceived 
in  the  correct  positions  by  the  panel  of  listeners.  For  these  perceived  boundaries, 
he  found  that  the  time  interval  between  vowels  did  not  usually  correlate  well  with 
whether  or  not  a  subject-predicate  boundary  was  between  the  two  syllables.  However, 
the  syllable  that  is  perceived  as  phrase-final  was  more  intense  (higher  in  VU  read¬ 
ing)  than  the  preceding  or  following  (non-phrase -final)  syllables.  No  strong  gen¬ 
eralizations  could  be  made  from  Scholes1  study  of  fundamental  frequency  (Fq)  con¬ 
tours,  primarily  because  he  only  investigated  whether  Fq  increased  or  decreased 
within  a  syllable,  not  how  Fq  values  in  one  syllable  related  to  those  in  other 
syllables.  Other  studies  (Lea,  1971,  1972a,  1972b)  show  good  correspondence  be¬ 
tween  Fq  contours  and  boundaries  between  syntactic  units. 

As  pointed  out  in  a  recent  study  (Lea,  1972a,  section  1.4),  most  attempts 
to  find  acoustic  cues  for  syntactic  boundaries  have  involved  scxne  of  the  most 
questionable  syntactic  boundaries  possible,  including  the  subject-predicate  boun¬ 
dary  of  a  sentence,  and  such  small— unit  distinctions  as  light  housekeeper  versus 
lighthouse  keeper.  Studies  to  be  reported  on  in  sections  3.4  and  4  are  designed 
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to  test  for  prosodic  cues  to  constituent  boundaries  in  a  variety  of  positions  in 
syntactic  structure. 

2«3  Prosodic  Cues  to  Stress  Patterns  and  Categories 

Besides  allowing  segmentation  of  sentences  into  syntactic  units,  prosodic 
features  can  also  provide  some  cues  to  the  categories  of  syntactic  units  (such 
as  sentences,  nuclear  noun  phrases,  compound  nouns,  etc.).  This  would  be  accom¬ 
plished  primarily  through  using  prosodic  features  to  deteimine  stress  patterns, 
which  are  known  to  associate  closely  with  syntactic  bracketing  and  syntactic 
categories. 

Since  fundamental  frequency  (Fq)  and  intensity  both  tend  to  be  higher,  and 
phonetic  durations  tend  to  be  longer,  for  stressed  than  for  unstressed  syllables, 
monitoring  Fq  contours  and  intensity  contours  might  determine  the  relative  stress 
levels  throughout  the  utterance  (Lea,  1972a;  IN  PRESS,  p.  200;  Hughes,  Li, 
and  onow,  1972;  Medress,  Skinner,  and  Anderson,  1971).  The  following  conjecture 
is  then  suggested: 

"If  one  could,  by  tracking  acoustic  features  such  as  voice  fundamental 
frequency  (pitch),  average  speech  power  (intensity),  and  phonetic  dur¬ 
ations,  determine  the  stress  pattern(s)  of  an  utterance,  and  if  such 
stress  patterns  could  predict  vital  aspects  of  surface  syntactic  structure, 
then  one  might  be  able  to  use  such  prosodic  information  to  automati¬ 
cally  guess  at  surface  syntactic  structures."  (Lea,  IN  PRESS,  p.  200) 

To  test  such  an  idea,  one  must  have:  (l)  an  adequate  procedure  for  deter¬ 
mining  stress  patterns  from  acoustic  features;  and  (2)  an  adequate  set  of 
rules  for  predicting  syntactic  structure  given  that  the  stress  pattern  can  be 
established. 


Various  schemes  for  determining  stress  from  acoustic  features  have  been  or 
are  being  investigated  (Fry,  1955;  Hughes,  Li,  and  Snow,  1972;  Lieberman,  I960; 
1967a;  Medress,  Skinner,  and  Anderson,  1971).  Further  studies  of  the  acoustic 
correlates  of  stress  in  connected  speech  are  needed.  In  fact,  such  studies  of 
stress  constitute  a  major  portion  of  Uni van 1 s  present  and  proposed  efforts 
for  ARPA.  Experiments  with  the  Rainbow  script,  to  be  described  in  section  4, 
are  aimed  at  determining  (among  other  things)  the  correlation  between  parameters 
of  Fq  and  energy  contours  and  the  perceived  and  linguistically-predicted 
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stress  patterns.  Further  studies  will  be  proposed  For  determining  acoustic 
correlates  of  the  stress  patterns  in  a  variety  of  specially  designed  sentences, 
so  that  interfering  effects  of  sentence  type,  syntactic  categories,  phonemic 
content,  and  the  like  can  be  independently  isolated. 


Even  when  one  finds  the  most  reliable  cues  to  stress  patterns,  his  job  of  syn¬ 
tax  recognition  is  far  from  done.  He  must  relate  stress  back  to  abstract  syntactic 
structures.  However,  it  is  not  easy  to  relate  stress  patterns  back  to  syntactic 
structures,  for  the  purpose  of  establishing  the  categories  of  constituents  and  the 
sentence  bracketing.  A  few  informal  rules  relating  stress  and  syntactic  categor¬ 
ies  have  been  known  for  some  time.  For  example,  it  is  well  known  among  phonolo- 
gists  that  monosyllabic  function  words  such  as  articles,  prepositions,  anaphoric 
pronouns,  and  conjunctions  are  characterized  as  unstressed  (or  "weakly  stressed") 
(Halle  and  Keyser,  1971,  p.  9).  "Substantives"  like  nouns,  and  most  verbs 
and  adjectives,  are  often  stressed,  particularly  if  they  are  polysyllabic. 

Thus,  if  a  word  or  syllable  is  found  to  be  stressed,  it  is  more  likely  to  be  a 
noun,  verb,  or  adjective,  (Chomsky,  1965,  Ch.  2)  than  a  function  word 
(sometimes  called  a  "grammatical  formative";  cf.  Chomsky,  1965,  Ch.  2). 


Chomsky  and  HajLle  (1968)  provided  explicit  rules  for  relating  syntactic 
categories  (and  phonemic  content  of  words)  to  stress  patterns.  Several 
revisions  to  their  rules  have  been  suggested  (Halle  and  Keyser,  1971;  Vanderslice, 
1969) »  but  certain  essential  features  are  common  to  all  such  formulations 
of  English  stress  reles.  Certain  lexical  stress  rules  (which  depend  upon  the 
phonemic  content  and  category  of  a  word)  dictate  which  syllables  in  polysylla¬ 
bic  words  are  stressed.  When  words  are  grouped  into  a  constituent,  two 
major  stress  rules  apply.  One  is  the  Nuclear  Stress  Rule  (Chomsky  and  Halle, 

1968;  Halle  and  Keyser,  1971),  which  says  that  if  several  lexically— stressed 
syllables  (vowels)  occur  in  a  constituent  (called  an  "^-constituent")  not 
labelled  as  a  noun  (N),  verb  (V),  or  adjective  (A),  then  the  last  of  these 
stressed  syllables  gets  the  primary  stress  and  the  others  are  reduced  in 
stress  level  with  respect  to  it.  The  other  major  rule,  called  the  Compound 
Stress  Rule,  says  that  for  constituents  labelled  N,  A,  or  V,  the  first  of 
such  lexically  stressed  syllables  gets  primary  stress.  The  rules  apply  cycli¬ 
cally,  starting  from  the  smallest  constituent  within  brackets  and  working  out 
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to  the  whole  sentence.  Prom  such  rules,  there  is  a  strong  tie  between  the 
categories  of  constituents  (N,  A,  V  versus of)  and  the  stress  pattern.  One 
might  hope  then  that,  knowing' the  stress  pattern,  he  could  Work  backwards  to 
the  syntactic  bracketing  and  categories.  In  actual  fact,  these  reverse  (stress- 
to-syntax)  rules  would  be  quite  complex,  and  no  one-to-one  map  from  stress  to 
syntactic  structure  would  be  possible  (Lea,  1972a,  p.  I63f).  Still,  an 
attempt  could  be  made  to  define  some  such  stress-to-syntax  rules  that  may 
be  appropriate  for  sneech  recognition  procedures. 

2.4  Prosodic  Aids  to  Distinctive  Features  Estimation 

Most  of  the  work  outlined  in  the  previous  section  is  concerned  with  recogni¬ 
zing  syntactic  structure  from  prosodic  patterns,  without  the  use  of  any  seg¬ 
mental  phonemic  information.  But,  a  speech  understanding  system  must  ultimately 
use  distinctive  features  information,  plus  syntactic  parsers  and  semantic  pro¬ 
cesses,  in  the  total  effort  in  sentence  recognition. 

One  way  in  which  prosodic  information,  and  resulting  syntactic  segmen¬ 
tation  and  stress  pattern  analyses,  may  be  used  in  distinctive  features  estima¬ 
tion  is  as  follows.  At  an  early  stage  in  recognition,  one  detects  boundaries 
between  major  syntactic  constituents  from  prosodic  features.  (A  technique  for 
doing  such  is  described  below,  in  section  3»4.)  Then,  the  highest-stress  sylla¬ 
ble^)  within  each  constituent  is  (are)  located,  using  reliable  prosodic  cues  to 
stress.  (Techniques  for  stressed  syllable  location,  as  planned  at  Univac,  are 
outlined  in  section  4.)  Some  distinctive  features  are  then  to  be  estimated  with¬ 
in  these  stressed  syllables,  since  the  consonants  and  %'owels  are  expected 
to  be  easier  to  categorize  in  stressed  syllables  than  in  weakly  stressed  or 
reduced  syllables  (Hughes,  Li,  and  Snow,  1972).  Next,  the  partial  distinctive  fea¬ 
tures  description  is  matched  with  generated  or  stored  patterns  for  possible  stressed 
syllables  or  words  in  the  lexicon.  Then  a  guess  as  to  the  word  content  of  the  const¬ 
ituent  is  made,  based  on  the  reliable  feature  information  from  the  stressed  sylla¬ 
ble^),  plus  other  reliable  data  within  the  constituent  (such  as  presence  of  unvoiced 
coronal  strident  fricatives,  etc.;  cf.  Medress,  1969,  1972).  If  reliable  decisions 
cannot  be  made  based  on  such  minimal  feature  information  within  the  constituent, 
analyses  are  then  applied  to  other  words  or  syllables  at  lower  stress  values,  and 
a  guess  based  on  the  two  or  more  moderately-stressed  syllables  is  made. 

Iteration  would  continue  until  all  syllables  are  analyzed,  if  necessary.  Each 
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iterative  guess  as  to  constituent  identity  would  be  combined  with  those  for 
other  constituents  in  the  sentence  until  a  satisfactory  set  of  hypotheses  for 
all  constituents  yielded  the  grammatical,  meaningful  sentence. 

One  assumption  in  this  approach  is  that  the  phonetic  dependencies  across 
constituent  boundaries  will  be  considerably  less  than  the  interdependencies 
within  a  constituent.  Then  recognition  of  these  substantially  isolatable  consti¬ 
tuents  could  be  attempted,  presumably  more  reliably  than  context-independent 
phonemic  segmentation  and  classification  could  be  achieved.  This  would  consti¬ 
tute  an  assumption  that  linearity  and  invariance  conditions  are  essentially 
satisfied  for  constituents.  Previous  studies  (Lea,  1972a)  showed  that  full- 
clausal  embedded  sentences  and  matrix  sentences  were  separated  by  long  pauses. 
Certainly,  phonetic  dependencies  might  be  expected  to  be  less  likely  across 
such  structural  pauses,  and  one  might  hope  that  this  will  also  be  time  for 
smaller,  more  manageable  constituents. 

The  essential  ingredient  of  this  type  of  approach  is  that  speech  recogni¬ 
tion  involves  using  prosodic  features  to  make  early  hypotheses  about  syntactic 
structure,  which  then  can  be  used  to  guide  distinctive  features  estimation  pro¬ 
cesses.  Ultimately,  what  is  sought  are  prosodic  cues  to  the  phonological  rules 
which  have  applied  to  surface  syntactic  structures  to  yield  the  observed  acous¬ 
tic  data. 

2,5  Prosodic  Aids  to  Syntactic  Parsing 

The  methods  described  in  the  previous  section  for  guessing  the  word 
content  of  each  constituent  depend  upon  determination  of  aspects  of  syntactic 
structure  before  the  terminals  ("leaves")  of  a  syntactic  tree  are  determined 
(cf.  Willems,  1972;  Lea,  1972a).  Syntactic  parsers,  on  the  other  hand,  usually 
address  the  problem  of  determining  the  labelled  bracketing  of  a  sentence,  given 
the  terminal  string  as  input  information.  A  major  task  is  to  establish  how 
pro sodically-det ermined  syntactic  structure  may  be  used  to  aid  syntactic 
parsers.  How  can  one  use  the  syntactic  segmentation  and  stressed-syllable- 
location  procedures  to  help  disambiguate  terminal  strings?  Part  of  the  answer 
lies  in  determining  specific  problem  areas  in  parsing  that  could  be  helped  by 
knowledge  of  stress  or  boundary  information. 
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For  example,  structures  of  the  form  Noun  Phrase  -  Prepositional  Phrase  - 
Prepositional  Phrase  are  said*  to  give  particular  difficulties  to  syntactic 
parsers,  in  that  the  associations  between  the  phras^  can  be  quite  different 
from  utterance  to  utterance.  The  following  partially  structured  sentences 
illustrate  a  few  cases  where  different  structures  and  semantic  associations 
have  similar  NP-PP-PP  surface  structures: 


He  couldn’t  find 


He  couldn’t  find 
NP 


He  couldn't  find 


r  the  pi  anej 
NP 

[|  [[the  plane  [] 
NP 

P  the  plane J 
NP 


[  fin  [[the  glidepath!  f  on  his  radar]  []  J 

Adv  PP  NP  1  PP 

i 

I 

£  in  the  glidepath]  j  [  [[  on  his  radar  jj  J 


PP 


Adv  PP 


[[  [[  in  the  glidepath]!  1  [[  [[on  his  first  try  J  j 


Adv  PP 


}*—  NP 


—  PP  - 


] Adv  PP 
>H  - pp 


2.6  Other  Uses  of  Prosodic  Cues 

There  is  some  possibility  of  detecting  aspects  of  semantic  structure 
from  prosodic  patterns.  It  is  known  that  amotion  and  some  semantic  distinctions 
(uncertainty,  incomnletion,  doubt,  etc.)  affect  intonation  and  other  prosodies 
(Armstrong  and  Ward,  1926;  Huttar,  1968).  Also,  grammatical  relations  such 
as  coreference,  contrast,  antecedent— pronoun  associations,  etc.,  have  been 
said  by  linguists  to  have  regular  effects  on  intonation  (Cantrell,  19^9). 

Recent  rules  for  stress  assignment  (Bresnan,  1971,  1972)  assume  that  relative 
stress  levels  (determined  by  the  nuclear  stress  rule)  are  dictated  by  the 
embedded  deep  structures  of  sentences,  which  are  applied  through  the  iterative 
syntactic  transformationa"  cycle.  Since  deep  structures  are  closely  associated 


^According  to  personal  communication  with  Jerry  Wolf,  BBN. 
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with  semantic  interpretations  of  sentences,  stress  levels  (and  thus  their 
acoustic  correlates)  might  then  be  relatable  to  underlying  semantic  structures, 
These  claims  about  semantic  cues  in  prosodic  patterns  must  be  instrumentally 
investigated. 
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3.  SYSTEMS  FOR  EXTRACTING  PROSODIC  AND  DISTINCTIVE  FEATURES 

In  section  2,  the  motivation  was  given  for  a  program  for  applying  prosodic 
features  to  the  detection  of  stress  and  syntactic  boundaries.  To  use  prosodic 
information  in  cooperation  with  a  partial  distinctive  features  estimation  pro¬ 
cedure,  facilities  are  required  which  extract  prosodic  features  of  fundamental 
frequency,  speech  energy,  and  timing  and  durational  data,  as  well  as  spectral 
data  and  formant  values  versus  time.  In  this  section,  we  describe  such  facilities 
as  implemented  at  Univac,  Defense  Systems  Division  (DSD),  and  their  use  in  a 
computer  program  for  detecting  sentence  boundaries,  constituent  boundaries,  and 
other  cues  to  syntactic  structure.  These  feature -extraction  and  structure- 
detection  facilities  will  be  coupled  with  techniques  of  stressed-syllable  location 
to  provide  acoustic  guidelines  to  syntactic  parsers,  semantic  processors,  and 
procedures  for  identifying  the  lexical  identity  of  distinctive  features  pal  terns. 

The  speech  analyzing  capabilities  include  methods  for  linear  predictive 
analysis  and  formant  tracking  (section  3.2),  prosodic  features  extraction  (sec¬ 
tion  3.3),  and  syntactic  constituent  boundary  detection  (section  3.4).  These 
analysis  tools  are  operating  within  a  versatile  interactive  speech  research  fac¬ 
ility,  to  be  described  next. 

3.1  Interactive  Speech  Research  Facility 

The  Univac  speech  research  facility  is  a  highly  flexible  and  interactive 
system  designed  especially  for  processing  and  studying  speech.  The  speech  fac¬ 
ility  is  located  in  the  Speech  Communications  Laboratory  adjacent  to  the  Univac 
DSD  Engineering  Computer  Center.  In  addition  to  fabrication,  testing,  and  stor¬ 
age  facilities,  the  laboratory  contains  a  12'  by  12*  Industrial  Acoustics  Corp¬ 
oration  sound  isolation  room.  This  room  provides  an  extremely  quiet  environment 
for  the  speech  research  terminal.  High  quality  audio  tapes  with  no  significant 
backgound  noise  can  be  made  there  for  subsequent  analysis,  or  speech  may  be  en¬ 
tered  directly  into  the  computer  for  study. 

A  block  diagram  of  the  system  is  shown  in  Figure  2.  With  this  facility 
speech  can  be  appropriately  filtered  and  digitized  at  up  to  20  kHz  and  stored 
on  a  drum.  It  can  then  be  played  back  over  a  speaker  or  displayed  on  a  cath- 
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Figure  2.  Block  diagram  of  the  Uhivac  interactive  speech  research  facility, 


Report  No.  P.X  7940 


UNIVAC 


ode  ray  tube  (CRT).  The  digitized  acoustic  waveform  can  also  be  processed  by 
fast  Fourier  transform  (FFT)  or  linear  prediction  programs  to  obtain  short- 
time  spectral  patterns,  and  by  other  algorithms  to  generate  fundamental  fre¬ 
quency  and  energy  contours,  formant  tracks,  and  other  data  of  interest.  All 
of  the  data  can  then  be  simultaneously  displayed  and  examined  on  the  CRT.  In¬ 
teractive  control  of  the  system  is  provided  by  toggle  switches,  push  buttons, 
potentiometers  and  an  alphanumeric  display  and  keyboard. 

In  order  to  study  long  utterances  and  inter-sentence  effects,  the  research 
facility  can  accept  and  process  up  to  12  seconds  of  speech  at  one  time.  The 
interactive  display  can  be  used  to  simultaneously  examine  the  time  waveform, 
smoothed  spectra,  and  up  to  20  time  functions,  including  formant  tracks,  voicing 
and  fundamental  frequency  contours,  and  various  frequency  delimited  energy  func¬ 
tions  for  a  full  12  seconds  of  speech. 

A  digital  tape  storage  facility  has  been  developed  so  that  a  local  data 
base  can  be  built  up  during  the  course  of  speech  studies.  This  facility  has 
been  designed  to  provide  fast  and  easy  access  to  previously  processed  speech  data 
for  reexamination  and  further  processing.  It  will  complement  and  be  compatible 
with  the  Lincoln  Speech  Data  Facility. 

Finally,  steps  are  being  taken  to  connect  the  speech  research  facility  to 
the  ARPA  network  through  a  Very  Distant  Host  Interface  and  a  50  kilobit  line. 
Software  and  hardware  design  should  be  completed  by  mid-November.  It  is  anti¬ 
cipated  that  the  network  connection  can  be  implemented  in  February,  1973. 

3.2  Linear  Prediction  and  Formant  Tracking 

The  technique  of  speech  analysis  by  linear  prediction  has  been  implemented 
on  the  speech  research  facility.*  This  analysis  technique  produces  very  high 
quality  smoothed  spectra.  A  formant  tracking  program  similar  to  one  of  Shafer 
and  Rabiner  (1970)  has  been  developed  utilizing  the  smoothed  spectra  obtained 
from  linear  prediction,  thus  making  formant  information  now  available  for  study 
and  use. 


In  the  Univac  implementation  of  linear  prediction,  the  assistance  of  John  Makhoul 
(Makhoul  and  Wolf,  1972)  has  been  appreciated. 
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In  the  experiments  with  the  Rainbow  Script  (described  In  section  4 )>  the 
acoustic  waveform  was  first  prefiltered  to  4782  Hz  with  a  seventh  order  elli- 
pitic  function  (Cauer)  low  pass  filter  provided  by  Lincoln  Laboratory.  The 
following  parameter  values  were  then  used  in  the  analysis:  10  kHz  sampling 
rate  (and  thus  a  5  kHz  frequency  analyzing  bandwidth),  25*6  msec  Hamming  weighted 
spectral  analysis  window  with  a  10  msec  advance,  and  12  predictor  coefficients. 

In  addition,  the  speech  signal  was  processed  without  pre-emphasis.  This  permits 
the  evaluation  of  the  linear  predictor  normalized  error  as  a  potential  voicing 
function.  It  has  been  shown  (Makhoul  and  Wolf,  1972)  that  the  normalized  error 
is  not  a  good  indicator  of  voicing  if  the  speech  has  been  pre-emphasized. 

3.3  Prosodic  Features  Extraction 

Fundamental  frequency,  energy,  quality,  and  duration  are  useful  prosodic 
features  (Lea  1972aj  Medress  and  Skinner  1972;  Medress,  Skinner,  and  Anderson, 
1971).  In  the  Rainbow  Script  experiments  (discussed  in  section  4),  various 
tabular  and  graphical  time  functions  were  extracted  which  are  indicative  of 
these  prosodic  features. 

By  autocorrelating  the  center-clipped  acoustic  time  waveform  (Sondhi,  1968), 
a  fundamental  frequency  measure  was  obtained  every  10  msec  for  a  51®?  msec  time 
segment  over  a  range  of  70  to  400  Hz.  Fundamental  frequency  in  Hertz  was  also 
converted  to  eighth-tones,  yielding  a  log  frequency  scale  for  relative  measure. 
Alternative  methods  for  fundamental  frequency  determination  (cepstral  analysis, 
Noll,  1967;  linear  prediction,  Makhoul,  1972)  have  been  implimented  on  the 
research  system  but,  at  present,  do  not  perform  as  accurately  as  autocorrelating 
the  center-clipped  time  waveform. 

Various  frequency-dependent  energy  functions  were  computed  as  part  of  the 
Rainbow  Script  experiments.  One  time  function  which  reflects  total  energy  in  a 
25*6  msec  window  every  10  msec  was  computed  from  the  sum  of  the  squares  of  the 
time  waveform  values  (Blackman  and  Tukey,  1958).  (This  sum  is  the  first  auto¬ 
correlation  term,  an  intermediate  parameter  of  linear  prediction  spectral  ana¬ 
lysis.)  Other  energy  measures  obtained  in  the  frequency  domain  include:  (a)  a 
low  frequency,  sonorant  energy  function,  computed  by  summing  the  squares  of  the 
smoothed  spectral  magnitudes  from  0  to  3000  Hz  and  then  converting  the  sum  to 
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dB,  and  (b)  a  broadband  energy  function,  computed  similarly  by  summing  the 
squares  of  the  smoothed  spectral  magnitudes  from  0  to  5000  Hz  and  converting  the 
sum  to  dB. 

Other  outputs  from  the  Rainbow  Script  experiments  are  hoped  to  be  useful 
as  quality  and  durational  indicators.  These  include  digital  spectrograms  and 
formant  tracks  (both  tabular  and  graphical),  from  which  vowel  reduction  can  be 
estimated.  In  addition,  correlation  (Hogg  and  Craig,  1965)  and  spectral  deriva¬ 
tive  (Medress,  1969)  time  functions  were  computed.  These  functions  indicate 
some  vowel  boundaries  (and  thus  some  phonetic  durations).  For  example,  at  a 
vowel-obstruent  boundary,  a  definite  peak  will  occur  in  the  spectral  derivative 
function  and  a  prominent  valley  will  appear  in  the  correlation  function. 

Figure  3  is  a  typical  graph  of  total  energy  in  dB,  (as  computed  in  the 
time  domain)  and  fundamental  frequency  (in  eighth  bones)  for  speaker  ASH  re¬ 
citing  "they  act  like  a  prism",  as  extracted  from  his  reading  of  the  connected 
text  of  the  Rainbow  Script.  A  value  is  recorded  for  each  function  every  10  msec 
from  time  3160  to  4880  msec.  Energy  is  shown  by  the  symbol  "B"  and  fundamental 
frequency  is  indicated  by  the  symbol  "0".  The  tabular  data  at  the  top  of  the 
graph  are  fundamental  frequency  in  Hertz,  broadband  energy  in  dB  and  fundamental 
frequency  in  tones.  For  example,  at  time  3750  msec,  the  energy  graph  is  at  68 
dB,  while  the  fundamental  frequency  function  is  at  71  tones.  Note  from  the 
tabular  data  at  time  3750  that  71  tones  corresponds  to  192  Hertz. 

Figure  4  is  a  photograph  of  the  interactive  graphical  display.  With  appro¬ 
priate  potentiometer  control,  that  portion  of  the  acoustic  time  waveform  which 
corresponds  to  the  data  of  Figure  3  ("they  act  like  a  prism",  speaker  ASH)  has 
been  selected  for  display  and  is  shown  at  the  bottom.  The  number  at  the  ordi¬ 
nate  shows  the  time  waveform  display  to  begin  at  3154.7  msec  and  include 
the  time  data  to  4879.9  msec,  the  time  indicated  at  the  base  of  the  time  waveform 
cursor.  At  the  top  of  the  display  is  a  spectral  plot  (relative  amplitude  in  dB 
versus  frequency)  of  a  vowel  portion  of  the  utterance.  The  particular  spectral 
frame  that  is  displayed  is  selected  with  a  potentiometer,  and  thus  the  short  time 
spectral  pattern  can  be  examined  throughout  the  entire  utterance.  Another  poten¬ 
tiometer  controls  the  position  of  a  cursor  used  to  examine  the  spectral  frame  being 
displayed.  In  this  picture,  the  cursor  is  positioned  at  a  spectral  peak  which  is 
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Figure  3.  A  graph  of  broadband  energy  (B)  and  fundamental  frequency  (0)  versus  time  for  "they  act  like  a 
prism"  by  talker  ASH.  Time  goes  from  left  to  right  in  milliseconds,  as  shown  by  the  set  of  numbers  at  the 
bottom.  Immediately  above  the  graph  is  a  tabular  listing  of  fundamental  frequency  in  Hertz,  then  energy 
in  dB,  and  at  the  top,  fundamental  frequency  in  eighth-tones  (refer  to  the  discussion  in  section  3.3). 
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Figure  4.  Photograph  of  the  interactive  graphical  display  for  the  utterance 
"they  act.  like  a  prism"  by  talker  ASH.  For  this  example,  the  display  shows 
(from  the  bottom  up)  the  time  wave,  energy  contour,  fundamental  frequency  con¬ 
tour  and  one  spectral  cross  section  (refer  to  the  discussion  in  section  3.3). 
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centered  at  1718  Hz  and  has  a  relative  amplitude  of  49  dB,  The  middle  of  the 
display  contains  the  time  functions  of  fundamental  frequency  in  Hertz  and 
broadband  energy  in  dB,  The  position  of  the  time  function  cursors  is  selected 
with  a  potentiometer  and,  for  time  3750  msec,  show  values  of  192  Hz  for  funda¬ 
mental  frequency  and  68  dB  for  broadband  energy,. 

Figures  3  and  4  thus  illustrate  the  types  of  displays  that  can  be  obtained 
for  speech  studies  such  as  the  experiments  to  be  described  in  section  4, 

3o4  Syntactic  Boundary  Detection 

The  prosodic  patterns  obtained  from  the  interactive  research  facility  can 
also  be  used  as  inputs  to  programs  for  detecting  aspects  of  syntactic  structure. 
Recent  research  (Lea,  1971,  1972a,  b)  nas  demonstrated  that  recognition  of  as¬ 
pects  of  syntactic  structure  can  be  accomplished,  in  part,  by  using  fundamental 
frequency  (Fq)  contours  to  detect  boundaries  between  major  syntactic  units. 

While  for  decades  linguists  have  claimed  that  intonation  and  stress  may  indicate 
the  immediate  constituent  structure  of  English  (Gleason,  1961;  Lieberman,  1967a, b; 
Trager  and  Smith,  1951;.  Wells,  1 947 ) ,  this  recent  work  has  explicitly  demonstrated 
success  in  computer  detection  of  syntactic  structure. 

Fundamental  frequency  contours  were  obtained  for  over  500  seconds  of  speech, 
including  short  stories,  newscasts,  weather  reports,  and  excerpts  from  conversa¬ 
tions,  spoken  by  nine  talkers0  A  decrease  (of  about  7$  or  more)  in  Fq  usually 
occurred  at  the  end  of  each  major  syntactic  constituent,  and  an  increase  (of 
about  7 $  or  more)  in  Fo  occurred  near  the  beginning  of  the  following  constituent. 

Figure  5  illustrates  the  Fq  contour  of  a  typical  sentence  taken  from  a  wea¬ 
ther  report.  Fall -rise  "valleys”  (marked  by  vertical  dotted  lines)  accompany 
the  syntactically  predicted  boundaries  (marked  by  arrows  labelled  with  the 
categories  of  surrounding  constituents).  A  computer  program,  based  on  the  reg¬ 
ular  occurrence  of  Fq  valleys  at  constituent  boundaries,  correctly  detected  over 
80$  of  all  syntactically  predicted  boundaries. 

Of  the  less  than  20$  of  "missing"  boundaries,  about  half  were  due  to  pre¬ 
dicted  boundaries  between  noun  phrases  and  following  verbals  (auxiliary  verbs 
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or  main  verbs).  There  is  considerable  evidence  (e.g.,  contractions  like  "I've, 
etc.)  that  such  noun  phrase  —  verbal  boundaries  would  not  be  expected  in  phono¬ 
logical  structure  (Lea,  1972a).  Thus,  when  boundaries  between  noun  phrases  and 
verbals  are  neglected,  about  9$  of  all  other  boundaries  are  detected. 


Besides  this  regular  acoustic  manifestation  of  boundaries  between  major 
syntactic  constituents,  some  boundaries  between  minor  constituents  (e.g. 
between  an  adjective  and  a  following  noun)  were  also  detected  by  the  fall-rise 
patterns  in  F^. 

Detecting  such  syntactic  structure  from  Fq  contours  is  complicated  by  the 
fact  that,  at  consonant-vowel  (and  vowel-consonant)  boundaries,  variations  in 
Fc  occur  which  may  be  confused  with  the  changes  marking  syntactic  boundaries. 

False  (syntactically  unrelated)  boundary  detections  resulted  from  Fq  variations 
at  these  boundaries  between  vowels  and  consonants,  but  most  such  false  alarms 
could  be  eliminated  by  setting  a  minimum  percent  variation  (about  10 %)  in  Fq 
for  a  boundary  detection.  A  detailed  study  of  Fq  variations  at  phonetic 
boundaries  (Lea,  1972a,  Ch.  4)  clearly  indicated  that  such  phonetically- 
dictated  changes  in  Fq  would  rarely  exceed  about  10%.  Studies  were  also  con¬ 
ducted  on  the  effects  of  stress  patterns  on  Fq  variations  at  consonant-vowel 
boundaries. 

Sentence  boundaries  (such  as  that  marked  S.  -  S.  in  Figure  5)  were  always 

«J 

accompanied  by  fall-rise  Fq  contours.  In  fact,  the  rise  in  Fq  (around  90%  change) 
after  a  sentence  boundary  was  substantially  larger  than  the  usual  rises  (about 
40%  or  less)  after  non-sentential  constituent  boundaries.  In  addition,  sentence 
boundaries  were  usually  (in  over  90%  of  all  cases)  accompanied  by  long  (35  centi- 
second)  stretches  of  unvoicing.  Here  "sentence  boundaries"  refer  to  both  bound¬ 
aries  between  matrix  (unembedded)  sentences  and  boundaries  between  embedded  full- 
clausal  sentences  (as  in  Figure  5). 

A  preliminary  st\idy  was  conducted  of  the  effects  of  specific  constituent 
categories  (noun  phrase,  verb,  prepositional  phrase,  etc.)  on  boundary  detection. 
The  lack  of  regular  boundary  marking  between  noun  phrases  and  following  verbals 
has  already  been  noted.  On  the  other  hand,  around  95%  of  all  boundaries  before 
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prepositional  phrases  are  detected  by  Fq  fall-rise  valleys.  This  might  be  es¬ 
pecially  useful,  since  NP-PP-PP  sequences  are  known  to  give  particular  diffi¬ 
culties  to  syntactic  parsers.  Also,  coordinate  noun  phrases  or  coordinate 
adjectives  were  always  accompanied  by  Fq  valleys  between  the  conjuncts.  Si2!65 
of  F  valleys  were  also  studied  for  the  various  syntactic  categories. 

The  constituent  boundary  detection  program  developed  by  Lea  at  Purdue  has 
been  implemented  as  a  FORTRAN  program  at  the  Univac  DSD  Speech  Communications 
Laboratory.  The  experiments  with  Fq  -  detected  constituent  boundaries  will  be 
extended  to  other  texts  and  talkers  (see  section  4) •  To  further  refine  the 
studies  of  how  syntactic  category  affects  Fq  contours,  a  controlled  experiment, 
is  also  being  planned,  in  which  position  in  the  sentence,  constituent  category, 
lexical  content,  and  other  factors  can  be  varied  separately  in  designed  texts. 
Syntactic  contrasts,  such  as  compound  structures  versus  nuclear  structures,  or 
simple  constituents  (KPs  John)  versus  coordinate  structures  (HP  and  NPs  John 
and  Bill),  etc.,  would  be  placed  at  various  points  in  the  structure  of  a  sentence 
to  isolate  the  effects  that  syntactic  categories  and  bracketing  may  have. 


The  previous  studies  of  boundary  detection  have  been  confined  to  declar¬ 
ative  sentences  in  spoken  texts  and  to  declarative  and  (a  few)  imperative 
utterances  excerpted  from  interviews.  Since  man-machine  interactions  for 
information  retrieval,  or  for  other  tasks  discussed  by  ARPA  Speech  Understanding 
Research  contractors,  will  undoubtedly  involve  commands  and  questions,  investi¬ 
gations  of  boundary  detection  in  questions  and  commands  would  be  appropriate. 

This  may  introduce  the  nejd  for  refined  boundary  detection  techniques  to  handle 
other  types  of  sentences. 

Such  studies  also  can  be  extended  by  investigating  what  syntactic  informa¬ 
tion  can  be  extracted  from  speech  intensity  contours  and  phonetic  durations 
(Willems,  1972).  In  particular,  intensity  sometimes  drops  at  boundaries,  much  as 
F  does  (though  sometimes  more  dramatically).  Intensity  contours  can  also  give 
more  precise  specifications  of  silent  pauses  than  the  mere  absence  of  Fq  can. 
Phonetic  durations  are  lengthened  in  phrase-final  positions  (Allen,  1968;  Barik, 
1969;  Barnwell,  1970}  Boomer,  1965;  Goldman-Eisler,  1958;  1961),  but  the  use  of 
durations  requires  phonetic  segmentation  processes. 
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To  date,  studies  have  not  been  concerned  with  exact  boundary  location:  only 
boundary  detection.  When  weakly  stressed  or  reduced  syllables  begin  a  consti¬ 
tuent,  the  Fq  valley  bottom  may  occur  within  that  weak  beginning  of  the  consti¬ 
tuent,!  When  a  previous  constituent  exhibits  a  "Tune  II"  intonation  rise 
(Armstrong  and  Ward,  1926)  at  its  end,  the  Fq  valley  bottom  may  occur  within 
the  last  syllables  of  the  prior  constituent.  Refinements  might  be  incorpor¬ 
ated  to  more  exactly  locate  the.  boundary  within  the  region  of  the  Fq  valley. 

The  refined  procedures  for  syntactic  segmentation  must  be  integrated  with 
stressed-syllable  location  procedures.  To  develop  clear  indications  of  the 
acoustic  correlates  of  stress  in  connected  speech,  and  to  test  refined  seg¬ 
mentation  procedures,  the  experiments  described  in  section  4  have  been  devised. 
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4»  EXPERIMENTS  ON  PROSODIC  PATTERNS  IN  THE  RAINBOW  SCRIPT 

The  philosophy  of  speech  recognition  outlined  in  section  2.4  suggests  the 
need  for  methods  of  demarcating  constituents,  finding  stressed  syllables,  and 
doing  a  partial  distinctive  features  analysis  on  the  reliable  data  within  the 
stressed  syllables.  A  method  for  demarcating  constituents  was  outlined  in 
section  3.4.  Its  implementation  and  refinement  at  Narva c  must  be  tested  with 
extensive  speech  data0  Methods  for  finding  stressed  syllables,  and  for  refining 
partial  distinctive  feature-'  estimation  techniques  will  be  developed.  They  depend 
upon  first  finding  reliable  acoustic  correlates  of  stressed  syllables. 

Stress  is  an  abstract  quantity  usually  considered  to  be  associated  with 
a  speaker's  total  physical  effort  in  speech  production  or  with  a  listener's 
perception  of  "prominent"  sy]  bles.  Extensive  work  has  been  done  on  acoustic 
correlates  of  stress  (cf.  reviews  by  Lehiste,  1970,  and  Medress  and  Skinner, 

1972),  and  on  physiological  correlates  of  stressed  syllable  production  (cf. 
revie:?  by  Lieberman,  1967).  On  another  hand,  linguists  have  devised  phonologi¬ 
cal  rules  that  purport  to  predict  the  stress  levels  and  vowel  reductions  in 
spoken  English  (Chomsky  and  Halle,  1968;  Halle  and  Keyser,  1971 ) .  Yet,  this 
work  has  not  answered  vital  questions  about  how  to  automatically  locate  stressed 
syllables  in  connected  speech. 

In  this  section,  we  will  outline  experiments  which  are  designed  to  deter¬ 
mine  what  acoustic  features  correlate  well  with  the  stress  levels  and  syntactic 
boundaries  in  a  connected  speech  text.  Based  on  the  experimental  results  to  be 
obtained,  computer-implementable  techniques  for  stressed  syllable  location  (and 
refired  constituent  boundary  detection)  will  later  be  developed. 

A  three-fold  experimental  effort  is  involved  in  this  research,  with  these 
major  data-gathering  tasks: 

1.  Syntactic  analysis  of  the  sentences  in  the  speech  text,  followed  by 
application  of  appropriate  stress  rules  and  vowel  reduction  rules,  to 
predict  stress  levels  and  vowel  reductions  in  the  script; 

2.  Presentation  of  the  script,  spoken  by  six  talxers  (4  male,  2  female), 
to  four  listeners  ( individually) ,  for  their  judgments  as  to  which 
syllables  are  stressed,  reduced,  or  ^mstressed;  and 
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n.  Processing  of  the  spoken  scripts  by  the  interactive  speech  research 

facility  fsee  section  3),  to  obtain  data  on  the  F0  contours,  intensity 
contours,  and  other  acoustic  features  that  may  correlate  with  syllable 
stress  and  reduction. 


Following  such  data-gathering  tasks,  there  come  the  extensive  tasks  of:  relating 
the  linguistic  predictions  of  stress  to  the  perceptual  results;  comparing  the 
perceptions  of  the  various  listeners,  and  determining  variations  from  tatter  to 
tatter;  and  relating  the  various  acoustic  features  to  the  perceptually  established 
stress  patterns  and  to  the  linguistically  predicted  stress  patterns.  Conclusions 
must  then  be  drawn  concerning;  the  adequacy  of  the  linguistic  rules  in  predicting 
listener  judgments;  the  regularity  of  stress  judgments,  from  tatter  to  tatter  and 
f*om  listener  to  listener;  and  the  best  acoustic  correlates  of  stress  and  reduc¬ 
tion.  Methods  for  automatically  predicting  stress  from  acoustic  correlates  must 

be  considered. 

In  section  4.1  the  overall  design  of  the  experiments  is  discussed  and  re¬ 
lated  to  previous  studies  of  stress.  The  methods  of  syntactic  analysis,  and  the 
predictions  of  stress  that  will  result  from  linguistic  stress  rules,  are  dis-. 
cussed  in  section  4.2.  Section  4.3  describes  the  study  of  listeners1  perceptions 
of  stress  patterns  in  the  spoken  text.  Acoustic  correlates  of  perceived  or  pre¬ 
dicted  stress  are  discussed  in  section  4.5. 


4.1  Selection  of  Experimental  Conditions 

The  text  chosen  for  the  initial  studies  of  stress  analysis  and  boundary 
detection  is  the  first  paragraph  of  the  "Rainbow  Passage"  introduced  by 
Grant  Fairbanks  (1940),  and  used  extensively  in  speech  research.  The  text, 
hereinafter  referred  to  as  the  "Rainbow  Script",  reads  as  follows: 


"When  the  sunlight  strikes  raindrops  in  the  air,  they  act 
like  a  prism  and  form  a  rainbow.  The  rainbow  is  a  divi¬ 
sions  of  white  light  into  many  beautiful  colors.  These 
take  the  shape  of  a  long  round  arch,  with  its  path  high 
above,  and  its  two  ends  apparently  beyond  the  horizon. 
There  is,  according  to  legend,  a  oiling  pot  of  gold  at 
one  end.  People  look,  but  no  one  ever  finds  it.  When  a 
man  looks  for  something  beyond  his  reach,  his  friends  say 
he  is  looking  for  the  pot  of  gold  at  the  end  of  the 
rainbow." 
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Several  criticism;;  of  previous  work  motivate  the  present  experiments  with 
the  Rainbow  Script,,  These  criticisms  will  be  considered  in  the  following 
numbered  paragraphs.  Numbering  of  the  criticisms  here  corresponds  with  the 
similarly  numbered  advantages  of  the  present  experiment,  to  be  tabulated  at 
the  end  of  this  section. 

(1)  Some  studies  of  acoustic  correlates  of  stress  (e.g.,  Medress,  Skinner, 
and  Anderson,  1971)  have  been  based  on  intuitive  knowledge  of  stress  patterns, 
while  others  (e.g.,  Fry,  1958,  Bolinger,  1958;  Morton  and  Jassem,  1965)  have 
attempted  to  determine  how  listeners'  judgments  of  stress  vary  as  certain  acous¬ 
tic  features  are  varied.  These  studies  of  performance  may  be  contrasted  with 
linguist's  theoretical  models  (i.e.,  sets  of  stress-assignment  rules),  which  are 
usually  based  on  the  competence  of  an  ideal  speaker-listener,  distinguished 
from  performance  by  the  exclusion  of  memory-limitations,  speaker  and  listener 
differences,  etc.  The  present  study  attempts  to  associate  linguistic  predictions, 
listener  judgments,  and  acoustic  features.  From  such  a  three-way  association  we 
may  learn  how  acoustic  features  correlate  with  both  ideal  and  actual  listener- 
assigned  stresses,  we  may  see  something  about  speaker  and  listener  variability 
from  a  theoretical  norm,  etc. 

(2)  Many  studies  of  acoustic  correlates  of  stress  have  been  based  on  pre¬ 
senting  synthesized  speech  to  listeners.  Such  studies  take  advantage  of  the 
ability  to  separately  control  acoustic  features  of  synthesized  speech,  for 
testing  how  acoustic  variations  correlate  with  listeners'  perceptions  of  stress 
(Fry,  1955,  1958;  Bolinger,  1958;  Morton  and  Jassem,  1965;  Mattingly,  1966;  etc.). 
However,  one  must  be  very  cautious  about  simply  extrapolating  from  results  with 
unnatural  synthetic  speech  to  corresponding  claims  about  natural  speech. 

Similarly,  studies  of  listeners'  perceptions  of  structural  boundaries  and 
other  prosodic  structure  have  been  attempted  with  speech  data  distorted  so  as  to 
remove  or  mask  all  or  most  of  the  segmental  phonetic  information,  leaving  only 
prosodic  information  for  the  listener  (O'Malley  and  Peterson,  1966;  Blesser,  1969; 
O'Malley,  1972;  cf.  also  Lummis,  1971 ).  Techniques  used  include  inverting  the 
frequency  spectrum,  masking  with  noiso,  and  low-pass  filtering.  While  such  dis¬ 
tortions  may,  with  some  difficulty,  substantially  isolate  prosodic  information 
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from  segmental  data  and  lexical  context  information,  the  listeners’  behavior 
with  such  distorted  speech  may  or  may  not  correspond  well  with  their  responses 
to  prosodic  patterns  in  natural  speech.  Studies  with  natural  speech  would 
ultimately  seem  appropriate. 

(3)  A  recurrent  problem  in  stress  studies  is  the  indiscriminate  confusion 
of  stress  and  emphasis,  and  the  loose  concept  of  exactly  what  stress  is.  The 
most  blatent  violations  of  "knowing  what  you’re  looking  for  before  you  seek  its 
acoustic  correlates"  occur  when  researchers  ask  for  listener's  judgments  about 
stress  while  they  take  a  fixed  unambiguous  utter  cnee  and  increase  FQ,  intensity, 
vowel  durations,  or  such  within  one  vowel  or  syllable.  Thus,  for  example, 
Bolinger  (1958)  worked  with  individual  utterances  such  as: 

Wouldn't  it  be  easier  to  wait? 

Break  both  apart. 

Many  are  taught  to  breathe  through  the  nose. 

But  would  many  return? 

Alexander's  an  intelligent  conversationalist. 

and  varied  acoustic  features,  asking  listeners  to  judge  whether  easier  or  wait 
was  more  stressed,  etc.  Liebeman  (1967,  Ch.  4)  studied  the  contrast  between 

such  sentences  as 


Joe  ate  his  scup. 

Joe  ate  his  soup. 

Joe  ate  his  soup. 

where  the  underlined  word  is  given  special  prominence  or  emphasis.  Such  studies 
deal  with  what  properly  may  be  called  special,  emphasis,  in  that  words  or  syllables 
are  assigned  a  prominence  or  force  of  utterance  which  is  non-normative,  marking  a 
semantic  distinction  from  other  sentences  which  do  not  exhibit  these  special  ef¬ 
fects.  In  such  utterances,  prominence  is  specifically  intended  to  mark  a  dis¬ 
tinction  from  the  norm,  or  semantically  neutral  utterance,  with  the  same  word 

content. 
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In  contrast  to  such  emphatic  prominence,  stress  is  used  in  this  report  to 
refer  to  the  relative  prominence  of  syllables  in  the  normative  utterance  of  a 
sentence.  This  normative  stress  pattern,  which  we  might  also  term  linguistic 
stress,  is  what  should  be  predicted  by  linguistic  stress  assignment  rules,  such 
as  those  of  Halle  and  Keyser  (1971).  The  performance  of  an  individual  speaker, 
or  the  judgments  of  an  individual  listener,  will  approximate  this  norm,  but 
will  be  influenced  by  extralinguistic  factors. 

Many  studies  (Fry,  1955,  1958;  Bolinger,  1958;  Lieberman,  I960;  Mol  and 
Uhlenbeck,  1956;  Morton  and  Jassem,  1965;  Lea,  1972,  Ch.  5)  have  investigated 
the  acoustic  correlates  of  stress  contrasts  in  isolated  minimal  pairs  such  as 
noun-verb  pairs  (permit  -  permit,  etc.).  The  semantic  and  syntactic  distinctions 
in  such  pairs  are  marked  by  stress  contrasts,  not  phonemic  sequence  differences. 
While  such  studies  help  isolate  stress  effects  from  other  acoustic  factors,  they 
involve  a  special  case  of  stress  effects  which  may  or  may  not  give  acoustic  cues 
which  correspond  with  those  given  in  connected  sentences  or  even  in  other  multi¬ 
syllabic  non-minimum-pair  words  or  phrases, 

(4)  While  linguists  (Trager  and  Smith,  1951;  Chomsky  and  Halle,  1968)  have 
often  worked  with  four  or  more  levels  of  stress,  plus  a  category  of  reduced  vowels 
or  syllables,  some  tests  show  that  trained  listeners  can  reliably  and  consistantly 
judge  no  more  than  two  stress  levels,  plus  identifying  reduced  syllables  (Lieberman, 
1964;  Lea  and  Li,  IN  PREPARATION).  Three  levels  of  stressedness  will  be  used  in 
the  present  perceptual  or  acoustic  studies:  stressed,  unstressed,  and  reduced. * 

(5)  The  present  study  is  concerned  with  sentence  stress,  not  word  stress. 

While  word  stress  (also  called  "lexical  stress")  is  one  form  of  input  information 
into  the  rules  for  as t  uning  sentence  stress,  the  overall  sentence  stress  pattern 
is  also  a  function  of  syntactic  bracketing  and  syntactic  categories.  Few  exper¬ 
imental  studies  have  been  concerned  with  the  stress  patterns  throughout  sentences. 
Previous  perceptual  tests  with  sentence  material  have  involved  deciding  which  is 
the  most  stressed  syllable,  whether  a  specific  single  syllable  is  or  is  not 
stressed,  or  which  of  two  syllables  is  more  stressed.  The  present  experiments 
extend  studies  to  all  syllables  in  the  sentences. 

These  stress  categories  are  defined  in  section  4*3.  An  exception,  noted  there,  is 
where  one  listener  (WAL)  marked  four  levels  of  perceived  stress  for  one  repetition 
of  the  perceptual  experiment. 
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(6)  (7)  (8)  Isolated  sentences  tend  to  have  different  intonation  contours, 
and  perhaps  different  stress  patterns,  from  those  in  sentences  in  the  context  of 
discourse.  The  Rainbow  Script  used  in  the  present  experiments  is  a  well-known 
semantically  connected  text  (a  paragraph)  with  substantially  neutral  content, 
demanding  few  (if  any)  cases  of  special  emphasis,  but  with  a  variety  of  declar¬ 
ative  structures  included.  Compound  nouns,  nuclear  phrase  structures,  paranthe- 
tical-like  phrases,  conjuncts,  and  complex  sentence  embeddings  are  all  exhibited. 
Interrogatives  which  exhibit  rising  Fq  in  sentence-final  syllables  are  avoided. 

This  is  one  confusion  factor  which  questions  like  Bolinger’s  (illustrated  above) 
introduced  into  previous  studies.  Declaratives,  imperatives,  yes-no  questions,  and 
WH— word  questions  should  be  distinguished  and  handled  separately  in  stress 
studies. 

(10)  Intonation  of  various  sentence  types  has  the  most  pronounced  effect  on 
the  last  stressed  syllable  of  a  sentence  (or  clause)  and  any  subsequent  unstressed 
syllables  (cf.  Lehto,  1971;  Armstrong  and  Ward,  1929).  Figure  3  in  section  3.3 
(pc  25),  illustrates  several  effects  of  clause-final  tonalization  positions.  In 
the  clause  "they  act  like  a  prism"  of  Figure  3,  act  and  'pris'  of  prism  are  both 
stressed.  However  the  clause-final  word  prism  is  much  longer,  and  shows  a  charac¬ 
teristic  falling  Fo  contour,  in  contrast  to  a  shorter  and  rising-FQ  syllable  act. 
High  peak  F  would  suggest  that  act  was  stressed,  while  long  duration  of  prism 
may  either  be  attributed  to  stress  or  the  clause-final  position  (which  lengthens 
both  stressed  and  unstressed  syllables;  Kattingly,  1966).  Consequently,  percep¬ 
tual  and  acoustic  results  in  these  clause-final  positions  (called  tonalization  pos¬ 
itions)  may  be  different  from  those  in  the  rest  of  the  sentence.  The  analysis  of 
Rainbow  Script  data  should  involve  a  separate  consideration  of  acoustic  features  of 
stress  in  the  tonalization  and  non-final  (so-called  intonation  "body")  positions. 

(11)  (12)  Another  significant  aspect  of  the  present  study  is  the  set  of 
acoustic  features  to  be  monitored.  Several  different  parameters  of  Fq  contours, 
both  within  vowels  and  within  consonants,  are  to  be  compared  with  predicted  and 
perceived  stress  patterns.  Similar  parameters  of  energy  contours  are  also  to 
be  studied,  for  both  broadband  energy  and  low-frequency  energy.  Durational 
measures,  while  recognized  to  be  closely  correlated  with  stress,  are  de- 
emphasized  in  this  study.  The  reason  is  that  the  required  phonetic  boundaries 
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and  vowel  and  consonant  durations  are  difficult  to  automatically  determine. 

The  potential  for  machine  extraction  of  such  acoustic  cues  is  given  special 
consideration  here,  since  the  intention  is  to  develop  stress  cues  suitable 
for  application  in  speech  recognition  systems. 

(13)  In  evaluating  acoustic  correlates  of  stress,  the  influences  of  phon¬ 
etic  environment  (vowel  features  and  pre-  and  post-vocalic  consonants)  on  Fq 
contours,  energy  contours,  and  durations  must  be  considered.  Recent  research 
(Lea,  1972a,  Ch.  5)  showed  that,  in  isolated  words,  a  falling  Fq  contour  in 
the  beginning  of  a  vowel  may  indicate  either  that  the  preceding  consonant  was 
unvoiced  or  that  the  syllable  was  unstressed.  A  rising  contour  may  indicate  a  pre¬ 
ceding  voiced  consonant,  a  word-initial  vowel,  or  stress  on  the  syllable.  Peak 

Fq  and  amplitude  in  a  vowel  are  affected  by  whether  the  surrounding  consonants 
are  voiced  or  unvoiced,  and  by  whether  the  tongue  is  high  or  low  in  the  oral 
cavity  during  the  vowel  (Lehiste,  1970;  Lea,  1972,  Ch.  4).  Vowel  durations 
are  lengthened  before  voiced  consonants  (House  and  Fairbanks,  1951;  House,  I960; 
Lehiste,  1970).  The  study  of  acoustic  correlates  in  the  Rainbow  Script  will 
include  consideration  of  such  phonetic  effects. 

(14)  The  perception  tests  to  be  reported  in  section  4*3  are  with  several 
listeners  and  several  talkers  chosen  to  be  representative  of  a  wider  population 
(essentially,  those  with  General  American  dialect).  One  listener  repeated  the 
perception  test  several  times  to  determine  how  consistent  stress  perceptions 
are  from  time  to  time. 

(9)  (15)  Some  studies  of  listeners’  perceptions  of  stress  (e.g.,  Bolinger, 
1958)  have  required  the  listener  to  simply  record  whether  or  not  a  given  sylla¬ 
ble  is  stressed,  or,  alternatively,  which  of  two  syllables  is  more  stressed 
than  another.  To  get  judgments  for  all  the  syllables  in  a  sentence,  the  task 
would  have  to  be  repeated,  once  again  with  each  other  syllable  as  the  one  in 
question,  An  alternative  procedure  is  to  have  the  listener  listen  repeatedly 
to  the  same  recorded  speech  until  he  was  able  to  assign  a  judgment  to  each 
syllable.  This  is  the  technique  used  in  the  perceptual  tests  of  the  present 
study.  The  method  of  repeatedly  listening  to  each  utterance  has  apparently 
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not  been  used  before  in  stress  perception  studies,  and  its  relative  merits  are 
not  established.  Two  similar  techniques  were  employed  by  listeners  in  this 
study;  one  whereby  the  tape  is  rewound  and  replayed  at  will,  and  another  where 
a  sentence  or  clause  of  speech  is  digitized  and  repeatedly  replayed  by  com¬ 
puter  until  the  listener  terminates  the  repetition. 


In  summary,  we  may  list  the  following  characteristics  of  the  present  ex¬ 
periments  with  the  Rainbow  Script: 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 
(9) 

(10) 

(11) 


The  experimental  design  incorporates  linguistic  theoretical  conclusions 
about  stress  patterns,  .Listeners*  perceptions  of  stress,  and  studies 
ol  acoustic  correlates,  all  in  one  effort. 


The  speech  is  natural,  not  synthetic. 

The  study  is  of  linguistic  stress  (and  reduction),  not  special 
emphasis  or  special  minimal-pair  differences  (as  with  noun-verb  pairs). 

Three  levels  of  stressedness  are  studied:  stressed,  unstressed,  and 
PGuucedo 


Sentence  stress,  not  word  stress,  is  being  studied. 

A  semantically-connected  text  (paragraph)  is  used,  rather  than  iso¬ 
lated  sentences. 


The  text  is  a  well-known  text,  used  extensively  in  speech  studies. 
1940)nal  7  6Slgned  t0  display  "^bitual  pitch  patterns"  (Fairbanks 


and 

> 


The  text  incorporates  one  type  of  sentence  (declarative),  with  a 
variety  of  phrasal  structures. 


Perception  judgments  are  provided  for  all  syllables;  no  forced  choices 
are  demanded  as  to  most  stressed  syllable  in  utterance"  or  "svllable 
A  is  more  stressed  than  syllable  B".*  J 

In  the  analysis  of  perceptual  and  acoustic  results,  sentence  tonaliza- 
tion  (clause-final  or  breath-group-final)  positions  are  analyzed 
separately  from  neutral  (non-final;  intonation  body)  positions. 

Extensive  sets  of  acoustic  parameters  are  considered  as  prospective 
acoustic  correlates  of  stress.  * 


* 

In  measurement-theory  terms,  the  judgments  here  form  a  nominal  scale, 
an  order  or  ordinal  scale  of  measurement  (Stevens,  1951;  Lea,  1971).  * 


rather  than 
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(12)  In  evaluating  acoustic  correlates  of  stress,  consideration  is  given  to 
both  the  effectiveness  in  correlating  with  (or  "predicting")  stressed- 
ness  or  reduction  and  the  potential  for  automatically  extracting  the 
acoustic  features  from  connected  speech. 

(13)  The  influences  of  phonetic  environment  (vowel  features,  voiced  and 
unvoiced  post-  and  pre-vocalic  consonants)  are  considered  in  evaluating 
acoustic  correlates. 

(14)  Perception  tests  on  stress  are  made  by  several  listeners,  hearing  the 
utterance  portions  repeatedly,  with  several  male  and  female  talkers, 
and  with  a  test  made  of  the  repeatability  of  listener  judgments. 

(15)  Tape  rewind  and  replay,  as  well  as  computer  storage  and  replay,  provide 
two  distinct  methods  of  speech  presentation  to  the  listener,  which  may 
be  compared. 

(16)  Latest  work  on  English  stress  rules  and  vowel  reduction  rules  is 
applied  to  the  performance  problems  of  predicting  listener  perceptions 
of  stress  and  reduction,  and  of  acoustic  correlates  of  linguistic  pre¬ 
dictions. 


In  brief,  this  set  of  experiments  provides  a  comparison  among  linguistic, 
perceptual,  and  acoustic  data  on  total  stress  patterns  in  a  well  known  con¬ 
nected  declarative  discourse,  spoken  by  several  talkers0  Special  attention 
is  given  to  interfering  effects  of  intonation,  phonetic  context,  speaker  and 
listener  differences,  etc. 

In  section  4.2,  we  elaborate  on  advantage  (16)  in  the  above  list,  pointing 
out  how  recent  linguistic  work  will  be  applied  uo  practical  stress  predictions. 


4.2  Syntactic  Analysis  and  Linguistic  Stress  Predictions, 

Recent  published  rules  suggest  that  stress  patterns  within  words  and  in 
total  sentences  can  be  predicted  by  stress  rules  (and  vowel  reduction  rules) 
which  require  phonemic  content,  syntactic  bracketing,  and  syntactic  category 
names  as  input  information •  To  predict  stress  patterns  using  such  rules,  one 
must  first  perfom  a  syntactic  analysis  of  the  script,  and  specify  those 
phonemic  aspects  that  are  relevant  to  lexical  stress  assignment. 

A  complete  syntactic  analysis  of  the  Rainbow  Script  will  be  done,  using  a 
transformational  grammar.  Deep  syntactic  structure  will  be  found  fo^  each 
sentence  in  the  script,  and  the  surface  structure  will  be  determined  by 
applying  that  deep  structure  to  an  ordered  set  of  transformations  (baser  on 
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an  extension  of  the  grammar  given  in  Burt,  1971 )«  The  resulting  surface  struc¬ 
ture  will  then  to  be  subjected  to  phonological  readjustment  rules  (Chomsky  and 
Halle,  1968,  pp0  24f)  to  yield  the  so-called  "phonological  representation"  of 
the  sentences.  No  adequate  set  of  phonological  readjustment  rules  have  been  writ¬ 
ten,  but  some  expected  effects  are  known.  A  decision  must  be  made  either  to  hypo¬ 
thesize  what  phonological  representation  will  result  from  the  rules,  or  to  design 
adequate  rules.  The  phonological  representation  is  used  to  predict  expected  con¬ 
stituent  boundaries,  for  comparison  with  boundary  detector  results  (cf.  Lea, 

1972a,  Appendix  B.) 

The  phonological  representation  will  also  include  phonemic-string  informa¬ 
tion  for  the  words  in  the  structures.  This  phonological  representation  is  then 
to  be  processed  through  stress  assignment  rules  and  vowel  reduction  rules,  such 
as  provided  by  Halle  and  Keyser  (1971;  based  on  revisions  of  rules  given  in 
Chomsky  and  Halle,  1968;  cf0  also  Ross,  1969,  and  Vanderslice  and  Ladefoged 
1971).  Selection  of  adequate  stress  and  reduction  rules  is  another  major  task. 
Recent  revisions  in  stress  assignment  based  on  putting  the  Nuclear  Stress  Rule 
within  the  syntactic  transformational  cycle  must  also  be  considered  (Bresnan, 

1971,  1972;  Lakoff,  1972;  Berman  and  Szamosi,  1972).  Vowel  reduction  rules 
must  be  given  similar  attention,  and  predictions  of  reduced  vowels  must  be  ob¬ 
tained  as  well  as  predictions  of  stress  levels.  Alternative  rules  for  assigning 
stress  and  reduction  can  be  compared  with  the  perceptual  results,  to  determine 
which  rules  seem  to  be'  most  satisfactory  (cf.  Trager  and  Smith,  1951;  Crystal, 
1969). 


At  the  time  of  this  writing,  study  of  appropriate  syntactic  and  phonological 
rules  has  just  begun,  and  linguistic  predictions  are  thus  yet  to  be  obtained. 

4.3  Perception  Tests 

The  Rainbow  Script  has  been  read  by  six  talkers,  providing  natural  speech  for 

& 

both  perception  and  acoustic  analyses.  Four  listeners  have  been  presented  with 
the  speech,  and  asked  to  record,  for  each  syllable,  their  individual  judgment  as 
to  whether  the  syllable  was  spoken  as  stressed,  unstressed,  or  reduced.  One 
listener  repeated  the  experiment  several  times  to  establish  listener  consistency 

We  are  indebted  to  George  W.  Hughes  and  Kung-Pu  Li  at  Purdue  University  for  pro¬ 
viding  the  speech  recordings  anu  some  of  the  perceptual  data  for  the  Rainbow  script. 
The  perceptual  testing  procedures  used  here  are  based  on  a  modification  of  the  pro¬ 
cedures  used  by  Hughes,  Li,  and  Snow  (1972). 
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and  effects  of  some  other  experimental  variables.,  Here  we  discuss  in  some 
detail  the  preliminary  results  of  these  perceptual  tests. 

The  basic  method  of  the  perceptual  study  consists  of  presenting,  to  an 
individual  listener  through  earphones,  the  recordings  of  each  of  the  six  talkers. 
The  listener  is  asked  to  mark  (in  whatever  way  he  chooses),  for  each  syllable, 
whother  he  hears  that  syllable  as  stressed,  unstressed,  or  reduced.  To  facili¬ 
tate  marking  for  each  syllable,  the  script  is  typed  on  a  sheet  of  paper  with 
vertical  slashes  between  syllables.  A  mark  must  be  provided  for  each  syllable 
(between  two  slash  marks).  The  listener  receives  one  such  sheet  for  each  talker. 
Results  were  obtained  for  listeners  (ASH,  GWH,  WAL,  and  TBS)  who  are  trained 
in  the  speech  field,  and  thus  in  some  sense  familar  with  notions  of  stress  and 
vowel  reduction. 

The  script  was  spliced  into  clause  or  sentence  portions  separated  by  ] ong 
(about  3-second)  pauses.  This  facilitated  stopping  the  recording  after  each 
clause,  recording  certain  judgments,  then  backing  up  the  recording  again  to  the 
beginning  of  that  portion  of  the  script,  to  hear  to  again.  The  listener  could 
listen  to  the  tape  portions  as  often  as  necessary  to  mark  each  syllable.  He 
was  free  to  back  up  the  tape  at  his  choice,  and  no  time  limit  or  procedural 
constraints  were  placed  on  Mm.  At  least  one  listener  (WAL)  found  that  approxi¬ 
mately  once  for  each  syllable  a  tape  rewind  and  listening  was  required  to  firmly 
establish  the  categorizations. 

Listeners  reported  that  some  syllables  were  clearly  stressed  and  some  clearly 
reduced,  while  many  were  not  so  readily  categorized.  In  an  initial  experiment 
Where  listeners  had  been  asked  to  mark  each  stressed  syllable  and  each  reduced 
syllable,  those  not  marked  were,  by  default,  considered  as  unreduced,  unstressed. 
TMs  bias  toward  the  extremes  of  Mgh  stress  and  lowest  stress  (reduction)  prob¬ 
ably  carried  over  into  the  final  experiment.  Two  binary  decisions  thus  may 
appear  to  be  involved  in  the  judgments:  "Is  the  syllable  stressed?"  and  "Is  the 
syllable  reduced?"  A  "no"  answer  to  each  yields  the  middle  ground  of  "unstressed" 
syllables. 
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Figure  6  illustrates  the  results  of  the  initial  perception  test  for  one  talker 
(ASH) .  Perceptual  results  from  a  fifth  listener  (RPS)  were  included  along  with 
those  of  the  four  listeners  previously  mentioned.  Plotted  for  each  of  the  syl¬ 
lables  in  the  Rainbow  Script  are  the  number  of  stressed  judgments  minus  the 
number  of  reduced  judgments,  for  the  five  listeners.  Unstressed  judgments  were 
assigned  values  of  zero.  (No  cases  occurred  where  one  listener's  reduced  judg¬ 
ment  cancelled  another's  stressed  judgment.)  Thus,  if  all  five  listeners  heard 
a  syllable  as  stressed,  a  value  of  5  was  plotted;  if  two  perceived  a  syllable  as 
reduced,  and  the  other  three  perceived  it  as  unstressed,  a  value  of  -2  (minus 
two)  resulted.  The  syllables  which  were  most  definitely  stressed  (i.e.,  per¬ 
ceived  by  all  listeners  as  stressed)  thus  were  at  the  top  of  the  scale;  those 
definitely  perceived  reduced  were  at  the  bottom  of  the  scale.  One  listener 
(ASH)  unfortunately  provided  no  judgments  about  occurrences  of  reduced  syllables. 
Thus,  the  most  negative  values  shorn  are  -4,  indicating  unanimous  agreement 
among  the  four  listeners  judging  reduced  syllables. 

Figure  6  thus  shows  which  syllables  are  judged  by  a  set  of  listeners  to  be 
more  or  less  definitely  stressed,  unstressed,  or  reduced.  While  these  results 
are  for  the  initial  test  where  every  syllable  did  not  have  to  be  marked  (so  that 
unmarked  syllables  were,  by  default,  considered  unstressed),  similar  results  are 
to  be  obtained  for  the  more  controlled  tests  where  each  syllable  is  categorized. 
From  such  results,  one  can  readily  see  which  syllables  are  unanimously  judged 
as  stressed,  which  are  judged  as  stressed  by  a  majority  of  the  listeners,  etc. 

When  syllables  such  as  long,  round,  arch  in  the  second  sentence  shown  in  Figure 

6  are  unanimously  judged  as  stressed,  one  can  be  moi’e  confident  that  acoustic 
cues  to  stress  are  to  be  found. 

The  results  of  one  listener  (WAL)  marking  his  perceptions  for  the  same 
talker  (ASH),  under  several  different  conditions,  are  shown  in  Figure  7.  As  in 
Figure  6,  the  judgments  are  "plotted"  for  each  syllable.  The  little  boxes  con¬ 
nected  by  dashed  lines  show  whether  the  listener  judged  the  syllable  as  stressed, 
unstressed,  or  reduced,  when  the  listener  was  required  to  mark  all  syllables, 
while  repetitively  rewinding  and  replaying  the  taped  speech.  Also  shown  in  Figure 

7  are  X's  marked  at  the  stress  level  judged  in  the  earlier  test  where  no  mark 
was  provided  for  unstressed  or  missed  syllables.  These  X's  are  included  only 
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where  the  earlier  results  differed  from  the  judgments  when  every  syllable  had  to 
be  marked.  The  preponderance  of  X’s  at  the  "unstressed"  level  shows  that  most 
of  the  differences  from  the  later,  complete  test  were  due  to  missing  marks  on 
what  may  well  have  been  perceptually  stressed  or  reduced  syllables.  This  sup¬ 
ports  the  need  for  requiring  expressed  judgments  on  all  syllables. 

Figure  7  also  shows  results  when  the  same  listener  used  the  Univac  inter¬ 
active  speech  research  facility  to  digitize  and  store  the  clausal  portion  from 
the  tape,  and  play  it  back  repeatedly  through  the  D/A  converter.  This  eliminated 
the  cumbersome  stopping,  rewinding,  listening,  stopping,  rewinding,  etc.  of  the 
tape  recorder  system.  (It  did,  however,  introduce  some  background  noise  due  to 
A/D  and  D/A  processes. )  The  ready  repetition  with  the  digital  system  allowed 
a  finer  judgment  of  stress  "levels",  the  listener  believing  he  could  then  sep¬ 
arate  stressed  syllables  into  two  categories:  highly  stressed  and  less  stressed. 
These  four  categories  are  shown  in  Figure  7,  and  the  judgments  for  all  syllables 
are  connected  by  the  solid  lines.  Comparing  all  these  results  syllable -by-syllable, 
it  is  evident  that  the  two  tests  give  similar  results,  with  almost  all  "highly 
stressed"  syllables  and  most  "lesser  stressed"  syllables  from  the  computer-aided 
test  corresponding  to  stressed  syllables  in  the  tape-replay  test.  The  four-level 
judgments  might  be  said  to  break  up  a  continuum  (from  highest-stressed  to 
reduced)  into  four  categories,  by  a  different  setting  of  'thresholds'  than  the 
three-level  judgment  involves. 

Figure  7  is  drawn  to  such  a  scale  that  it  could  be  laid  directly  over 
Figure  6,  to  show  a  close  correspondence  between  the  results  of  five  listeners 
reporting  on  one  talker  and  the  results  of  one  listener  reporting  on  the  same 
talker  under  several  conditions.  The  close  correspondence  between  the  syllable- 
by-syllable  judgments  plotted  in  Figures  6  and  7  shows  that  the  one  listener  is  ih 
one  sense  "representative"  of  the  group  of  listeners. 

Figure  8  summarizes  the  perceptions  of  listener  WAL  when  marking  all 
syllables  in  the  scripts  read  by  each  of  the  six  talkers.  Results  with  the  tape- 
rewind  approach  are  compared  with  those  for  the  four-level  results  of  the  computer- 
replay  approach.  As  evidenced  by  the  syllable-by-  ’’liable  'plot'  of  Figure  7 
(for  one  talker),  those  syllables  judged  as  highly  stressed  in  the  computer-replay 
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(b)  Talker  CWH 
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Figure  8.  Comparison  of  three-level  and  four-level  stress  judgments  of  one  lis¬ 
tener  (WAL)  marking  all  syllables  in  the  Rainbow  Script  spoken  by  each  of  six 
talkers.  The  three-level  test  was  run  with  the  tape  rewind  method,  while  the 
four-level  test  involved  computer  storage  and  replay.  Tabulated  in  each  ma¬ 
trix  position  is  the  number  of  syllables  categorized  as  shown  by  the  headings 
on  the  respective  row  and  column. 
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run  (top  rows  in  all  matrices  of  Figure  8)  were  usually  judged  as  stressed  in 
the  tape-replay  run  (left  most  column).  Those  judged  as  stressed,  but  at  a 
lesser  level,  (second  row)  in  the  computer-replay  run  were  usually  judged  as 
stressed  in  the  tape-rewind  run.  Most  that  were  judged  as  reduced  in  one  run 
were  judged  as  reduced  in  the  other.  Thus,  results  for  each  of  the  other  five 
talkers  were  similar  to  those  reported  in  detail  in  Figure  7  for  single  talker 
ASH.  In  this  sense,  talker  ASH  is  representative  of  the  other  talkers. 

Figure  8  also  shows  that,  of  the  127  syllables  in  the  Rainbow  Script, 
about  50  to  60  (about  40  to  50%)  were  judged  as  stressed  (or,  alternatively, 
"highly  stressed"  or  "lesser  stressed"),  slightly  fewer  (about  35  to  40%)  were 
judged  as  reduced,  and  only  about  14  to  30  ( 10  to  25%)  were  judged  as  unstressed. 
Thus,  if  one  were  to  analyze  only  the  stressed  syllables,  as  suggested  in  section 
2.4,  the  distinctive-features  analysis  could  be  avoided  in  the  50  to  60%  of  un¬ 
stressed  and  reduced  syllables,  where  distinctive  features  analysis  is  most 
difficult  and  unreliable.  Figure  8,  and  Figures  6  and  7  as  well,  also  illu¬ 
strate  that  more  confusions  or  inconsistencies  occur  between  unstressed  and 
reduced  categories  than  between  stressed  and  unstressed  syllables  (cf. 

Lehto,  1969).  Thus,  a  procedure  for  reliably  distinguishing  stressed  from 
unstressed  syllables  might  be  more  successful  than  one  for  distinguishing  un¬ 
stressed  from  reduced  syllables. 

Figure  9  shows  perception  comparison  matrices  for  directly  comparing  listener 
WAL's  judgments  for  talker  ASH  with  those  for  the  other  five  talkers.  The  entries 
off  the  main  diagonal  of  each  matrix  show  the  deviations  from  identical  percep¬ 
tions  for  talker  ASH  and  the  other  talker.  Since  a  large  majority  of  the  sylla¬ 
bles  were  either  perceived  as  stressed  for  both  talkers,  unstressed  for  both,  or 
reduced  for  both,  talker  ASH  is  representative  of  the  other  talkers.  A  study 
of  the  perceptions  for  each  of  the  individual  syllables  (as  Figure  7  provided 
for  one  talker  and  one  listener)  will  indicate  which  syllables  are  most  likely 
to  be  pronounced  differently  by  different  talkers,  and  which  syllables  have 
the  most  stable  stress  assignment. 

Perception  tests  by  the  three  other  listeners  are  now  being  conducted, 
with  each  syllable  to  be  marked  as  stressed,  unstressed,  or  reduced.  Data  such 
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figure  9.  Comparsions  of  stress  level  perceptions  by  listener  HAL  for  talker  ASH 
versus  the  other  five  talkers.  Tabulated  are  the  number  of  syllables  iu&ed'as 
shown  by  the  column  heading,  for  ASH,  and  judged  as  shown  by  the  row  htaiing  Tor 
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as  that  illustrated  in  Figures  6  to  9  will  be  obtained  for  these  additional 


The  perceptual  results  are  to  be  compared  with  the  linguistically  pre¬ 
dicted  stress  levels  to  see  what  abstract  levels  are  judged  as  stressed,  un¬ 
stressed,  or  reduced.  For  example,  are  stress  levels  1  and  2  both  perceived  as 
••stressed",  or  only  level  1,  or  what?  The  perceptions  must  also  be  compared  wi 
the  acoustic  features,  to  be  discussed  next. 


4„4  Acoustic  Analysis 

The  spoken  scripts  were  processed  through  the  linear  predictor,  foment 
tracker,  prosodic  processors  and  boundary  detector,  described  in  section  3,  to 
obtain  data  on  the  F0  contours,  intensity  contours,  foments,  and  other  acous¬ 
tic  features  that  may  correlate  with  stress,  reduction,  and  syntactic  boundaries. 
A  variety  of  F  -contour  parameters  and  intensity-contour  parameters  are  to  be 
analyzed  as  potential  correlates  of  stress.  In  this  study,  only  FQ  contours 
are  used  in  detecting  constituent  boundaries,  in  accord  with  the  constituen 
boundary  detector  design  described  in  section  3.4. 


4.4.1  Fp  Correlates  of  Stress, 

The  F  parameters  to  be  considered  as  possible  acoustic  cues  to  stress 
include:  She  peak  FQ  in  the  vowel-,  the  mean  F  in  the  vowel;  the  F  rise  or 
fall  in  the  initial  portion  of  the  vowel;  the  F0  contour  shape  throughout  the 
vowel  (Medress,  Skinner,  and  Anderson,  1971);  and  FQ  values  invoiced  consonant. 
Also  to  be  considered  are  the  coefficients  of  polynomial  representations  of  „ 
contours  (Levitt  and  Rabiner,  1971 ). 


All  such  F  parameters  are  not  easy  to  automatically  extract  from  Fq  con- 
tours.  F  peaks  may  occur  in  transitions  between  non-vowel  sonorants  and  vowels, 
rather  til  within  the  vowel,  and  in  such  cases  may  or  may  not  closely  correspon 
with  stress  values.  Also,  FQ  peak  values  will  depend  upon  the  identity  (sfeci  i- 


"Where  "vowel" 
will  also  be  c 


is  used  here  and  throughout  this  section, 
onsidered  (Lehisto,  1972;  Stevens  and  Klatt 


the  syllable  nucleus 
,  1968;  Stevens,  1969). 
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cally,  the  high/low  feature)  of  the  vowel,  with  low  vowels  showing  lower  Fq 
peaks  for  the  same  laryngeal  tensions  anc  subglottal  pressure.  Average  values 
are  similarly  affected,  but  they  are  even  more  difficult  to  automatically  extract, 
since  the  durations  over  which  the  average  must  be  computed  must  be  determined, 
and  an  averaging  computation  is  required.  Finding  the  Fq  rise  or  fall  in  the 
initial  portion  of  the  vowel  demands  finding  the  vowel  onset,  which  is  particu¬ 
larly  difficult  following  sonorants,  but  much  easier  following  unvoiced  consonants. 
Fq  contours  throughout  the  vowel  also  require  establishing  the  endpoints  of  the 
vowel.  Fq  values  in  consonants  require  establishing  the  locations  of  consonants. 

Since  these  coxitour  parameters  are  highly  dependent  upon  locating  vowels 
and  consonants  and  their  boundaries,  they  may  not  lend  themselves  to  easy  auto¬ 
matic  extraction,  even  if  when  manually  extracted  they  correlate  well  with  stress 
patterns.  However,  if  some  other  features,  such  as  intensity  contours  or  formant 
structure,  can  locate  vowels  or  consonants  and  their  approximate  boundaries,  then 
these  Fq  contour  parameters  may  be  machanically  extracted. 


4*4.2  Intensity  Correlates  of  Stress 

Intensity  parameters  to  consider  include:  the  maximum  intensity  in  the 
vovelj  the  average  intensity  in  the  vowel;  the  intensity  rise  in  the  initial 
portion  of  the  vowel;  the  integral  of  the  energy  within  the  vowel;  the  overall 
intensity  contour  shape  in  the  vowel;  energy  within  prevocalic  and  postvocalic 
consonants;  and  the  presence  of  aspiration  after  stop  releases.  Full  broadband 
energy  will  include  aspirations,  high-frequency  frication  noise,  and  stop  bursts. 
Low-pass  filtered  energy  may  be  used,  however,  to  detect  phonation  or  voicing 
energy,  which  would  be  high  in  vowels,  presumably  smaller  in  nonvocalic  sonor¬ 
ants,  low  in  voiced  obstruents,  and  near  zero  in  unvoiced  consonants.  This 
may  provide  come  cues  to  the  presence  and  locations  of  various  categories  of 
consonants  and  vowels. 

Maximum  intensity  and  average  intensity  in  a  vowel  may  correlate  well  with 
stress  in  neutral  intonation  positions,  but  the  energy  integral  within  a  vowel, 
according  to  earlier  studies  (Lehto,  1971;  Medress,  Skinner,  and  Anderson, 

1971),  would  be  even  better.  However  the  energy  integral  requires  determining  the 
boundaries  of  the  vowel,  which  is  not  always  readily  accomplished  mechanically. 
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High-frequency  energy  within  obstruents  may  be  higher  when  they  are  in 
stressed  syllables,  so  that  the  difference  between  total  energy  and  phonation 
energy  might  be  considered  as  a  potential  cue  to  stress. 


4#4.3  Phonetic  Durations  as  Stress  Correlates 

Duration  and  timing  parameters  that  will  be  considered  as  potential  stress 
cues  include:  duration  of  the  portion  of  a  vowel  which  has  rising  intensity 
(from  previous  consonant  to  energy  peak  in  the  vowel  nucleus;  cf.  Lehto,  1969); 
total  vowel  duration;  durations  of  prevocalic  and  postvocalic  consonants;  time 
intervals  between  vowel  centers  (peaks);  and  total  syllabic  durations. 

To  mark  phonetic  boundaries  (and  thus  establish  phonetic  durations),  signifi¬ 
cant  changes  in  some  acoustic  features  would  presumably  be  sought,.  If  the  Fq 
contours  and  intensity  contours  do  not  provide  sudden  changes  appropriate  for 
such  boundary  marking,  then  other  acoustic  features  would  have  to  be  considered. 
Spectral  structure  (e.g.  presence  and  positions  of  formant  structures)  might 
provide  some  cues.  Ultimately,  however,  we  recognize  that  speech  is  not  a  seq¬ 
uence  of  discrete  acoustic  units  corresponding  to  individual  phonemes,  and  such 
phonetic  durations  can,  at  best,  be  approximate  indications  of  the  quantity  of 
the  wave  most  closely  associated  with  each  phone.  VJhile  subjectively  determined 
durations  (based  on  personal  judgments  as  to  where  significant  acoustic  changes 
occur)  may  be  found  to  correlate  well  with  stress  levels,  the  ease  with  which 
they  may  be  mechanically  determined  will  play  a  major  role  in  determining  their 
utility  for  speech  recognition  systems. 

4.4.4  Vowel  Quality  and  Reduction 

Vowel  quality  is  known  to  tend  more  toward  schwa-like  sounds  for  many  weakly 
stressed  and  reduced  syllables.  Reduction  often  causes  diphthongs  to  degenerate 
to  single  pure  vowels  (Fry,  1953),  and  causes  general  centralisation  of  a  vowel. 
Since  unstressed  and  reduced  vowels  tend  to  be  quite  short,  the  vowel  'target’ 
positions  of  articulation  are  not  attained  and  thus  formant  target  values  are 
not  reached  (Lindblom,  1963).  These  effects  may  provide  spectral  features  that 
can  be  correlated  with  stress  levels. 
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4*4*5  Initial  Examples 

An  earlier  study  of  acoustic  correlates  of  stress  in  isolated  words  (Medress, 
Skinner,  and  Anderson,  1971)  showed  that  vowel  duration  was  longer  in  (98$  of  all) 
stressed  syllables  than  in  unstressed  syllables,  when  the  stressed  syllable  was 
utterance— finale  However,  when  the  stressed  syllable  was  in  earlier  portions  of 
the  utterance,  durations  were  not  as  reliably  correlated  with  stress  (with  the 
stressed  vowel  duration  being  longest  in  67$  of  all  cases  where  it  is  in  a 
medial  syllable,  and  only  51$  of  the  time  in  stressed  initial  syllables). 

On  the  other  hand,  peak  fundamental  frequency  was  most  reliably  associated  with 
stress  in  the  earliest  syllables  (forming  the  intonation  head  for  the  utterance, 
in  common  intonation-contour  terms;  cf.  Lehto,  1969).  Peak  Fq  was  highest  in 
the  stressed  syllable  93$  of  the  time  for  initial  syllables,  60$  of  the  time 
for  medial  syllables,  and  40$  of  the  time  for  utterance-final  syllables.  Sim¬ 
ilarly,  the  average  energy  in  the  vowel  was  highest  in  the  stressed  syllables  in 
89$,  63$,  and  60$  of  the  initial,  medial,  and  final  syllables,  respectively. 

Thus  these  studies  with  isolated  words  showed  effects  that  the  position  in 
the  utterance  had  on  the  reliability  of  acoustic  correlates  of  stress.  Such 
effects  may  also  be  expected  in  connected  speech.  For  example,  in  section  4.3, 
paragraph  numbered  ( 1 0) ,  illustrations  were  given  of  such  effects  of  position  in 
the  clause  "they  act  like  a  prism"  from  the  Rainbow  Script  (see  also  Figure  3). 

4»5  Interpreting  the  Data 

The  prosodic  data  (as  in  Figure  3),  and  the  digital  spectrograms  and  formant 
tracks,  all  obtained  for  all  six  talkers  reading  the  Rainbow  Script,  will  provide 
extensive  acoustic  data  to  relate  to  the  perceptual  and  linguistic  data  about 
stress  patterns.  The  Fo  data  processed  through  the  constituent  boundary  detector 
will  also  yield  boundary  data  to  be  compared  with  expected  constituent  boundaries. 


The  percentage  of  all  expected  syntactic  boundaries  that  are  correctly  de¬ 
tected  will  be  determined,  as  will  the  number  of  "false  alarms"  (where  boundar¬ 
ies  were  not  expected). 
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A  syllable-by- syllable  comparison  of  the  acoustic  correlates  with  per¬ 
ceived  stress  levels  (probably  based  on  the  majority  or  unanimous  judgments  of 
all  the  listeners)  will  be  conducted.  The  acoustic  features  and  perceptual  data 
will  also  be  compared  with  the  linguistically-predicted  stress  patterns. 


When  listeners  agree  as  to  the  stress  level  of  syllables  spoken  by  a  talker, 
these  results  may  well  be  taken  as  the  standard  to  compare  with  linguistic  pre¬ 
dictions  and  acoustic  data0  Where  acoustic  correlates  or  perceptions  differ 
radically  fnm  linguistic  predictions,  the  fault  may  be  more  in  the  gap  between 
speech  performance  and  linguistic  models  of  competence,  than  in  difficulties  of 
perceptual  or  acoustic  analysis. 

Ultimately,  the  results  of  correlating  acoustic,  perceptual,  and  linguistic 
data  may  be  difficult  to  precis  «ly  interpret  because  of  several  interfering 
factors,  such  as  syntactic  phrase  structure,  positions  in  sentence  intonation 
contours,  phonetic  sequence  effects,  etc.  Studies  with  the  Rainbow  Script  will 
presumably  indicate  instances  where  such  factors  can  or  cannot  be  readily  isolated. 
To  isolate  such  factors  more  completely,  sentences  and  connected  texts  will  be 
specifically  designed,  as  part  of  the  further  studies  outlined  in  section.  5. 
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5.  FURTHER  STUDIES 


5.1  Reviewing  Speech  Texts  for  the  ARPA  Data  Base 

The  Univac  DSD  Speech  Communications  Group  is  currently  involved  in  a  pro¬ 
gram  to  select  good  speech  texts  for  use  by  the  systems  contractors  of  the  Speech 
Understanding  Research  Program.  In  cooperation  with  Professor  Michael  0»Malley 
of  the  University  of  Michigan  and  Dr.  June  Shoup  of  Speech  Communications  Research 
Laboratory  (SCRL),  a  two-phase  effort  is  being  undertaken.  In  one  phase,  sentences 
of  general  interest  to  the  five  systems  contractors  will  be  selected  from  a  larger 
set  which  the  system  builders  select  as  representative  of  the  type  data  they  hope 
to  handle.  In  another  phase  of  the  data-base  design,  texts  will  be  very  carefully 
designed  to  isolate  problems  that  are  expected  to  be  encountered  in  extendable 
speech  recognition  systems. 

To  accomplish  phase  one  of  the  program,  each  system  builder  will  select 
about  fifty  sentences  of  the  type  they  expect  their  system  to  handle,  giving  con¬ 
sideration  to  problems  they  anticipate  encountering  in  their  system.  These  will 
be  recorded  and  provided,  along  with  an  orthographic  transcription  and  a  list  of 
criteria  used  to  select  the  texts,  to  SCRL,  the  Univeristy  of  Michigan,  and  Univac, 
for  their  review. 

The  Univac  Speech  Communications  Group  will  give  particular  attention  to 
evaluating  characteristics  of  the  data  that  relate  to  prosodic  patterns  (such 
as  sentence  intonation  contour,  position  in  discourse,  and  stress  patterns)  and 
higher-level  linguistic  considerations  (number  of  words  in  the  vocabulary,  var¬ 
iety  of  words  in  each  parts -of- speech  category,  representative  variety  of  syntactic 
structures,  sentence  types,  semantic  relationships,  etc.).  Other  factors  being 
considered  (primarily  by  the  other  two  groups)  include  phonemic  variety  and 
balance,  allophonic  variations  in  the  text,  morphophonemic  phenomena  (such  as 
coarticulation  effects,  word  or  morpheme  boundaries,  etc.),  dialect  differences, 
and  style  (read  speech,  spontaneous  speech,  etc.). 

Following  the  separate  evaluations  of  some  of  these  characteristics  by  each 
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of  the  threo  groups,  a  common  workshop  will  be  held  where  the  three  groups,  in 
cooperation  with  James  Forgie  of  Lincoln  Laboratory,  will  select,  from  tne  250 
sentences,  ten  good  sentences  to  be  recorded  later  by  two  talkers  from  each  systems 
contractor,  plus  50  to  100  other  sentences,  either  selected  from  the  remainder  of 
the  acceptable  portion  of  the  original  250  sentences,  or  specifically  designed  to 
fill  any  gaps  in  the  data.  These  selected  utterances  will  form  an  initial  part 
of  the  Lincoln  Speech  Data  Facility. 

Later,  the  three  groups  and  systems  contractors  will  meet  to  devise  specific, 
ways  in  which  prosodic  features  and  phonetic  patterns  in  the  selected  data  may  be 
used  in  each  of  the  five  systems.  Univac  plans  to  process  some  of  the  data  through 
the  constituent  boundary  detection  program,  and  to  estimate  (or  actually  obtain) 
perceptual  and  acoustic  data  about  stress  patterns  in  representative  utterances. 
Such  indications  of  actual  or  expected  prosodic  patterns  will  be  available  for 
system  builders  and  others  to  use  in  devising  prosodic  aids  to  speech  understand¬ 
ing  systems. 


5#2  Designing  Sentences  for  Isolating  Prosodic  Effects 

Phase  two  of  the  data  base  design  program  is  concerned  with  designing  a  set 
of  sentences  or  texts  which  isolate  certain  factors  which  may  affect  the  success 
of  speech  understanding  systems. 

One  way  to  determine  what  is  causing  any  particular  pattern  in  speech  data 
(such  as  the  presence  or  absence  of  an  Fq  valley  at  a  constituent  boundary,  or 
the  occurence  of  a  long  or  short  vowel  duration  in  a  stressed  syllable  in  utter¬ 
ance-final  position)  is  to  compare  utterances  which  are  similar  except  for  only 
one  or  a  few  differences.  This  is  how  phonemes  of  English  are  sometimes  deter¬ 
mined,  using  minimal  pairs  (such  as  pit/pet  for  distinguishing  /i/  and  /e/). 
Likewise,  acoustic  correlates  of  stress  have  been  studied  based  on  contrasting 
words  like  permit  versus  permit,  in  which  only  the  stress  patterns  differ  in 
the  two  words.  These  techniques  may  be  applied  to  determining  effects  of  differ¬ 
ent  syntactic  stuctures,  sentence  types,  intonation  contours,  phonemic  structure, 

etc. 
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The  Univac  Speech  Communications  Group  is  designing  a  set  of  utterances  which 
can  isolate  the  specific  effects  (particularly  on  prosodic  patterns)  of  some  of  the 
following  factors:  spontaneous  speech  versus  spoken  texts  versus  speech  actually 
used  in  man-machine  interaction;  positions  of  sentences  or  other  speech  portions 
within  paragraphs  or  discourse;  the  type  of  sentence  (declarative,  yes-no  question, 
WH-question,  or  command);  positions  of  phrases  within  an  overall  sentence  intona¬ 
tion  contour  (intonation  body  or  tonalization  positions);  simplex  versus  complex 
sentences;  special  sentence  transformations  (e.g.,  adverb  preposing  or  passives); 
special  phrase  category  effects  (e.g.,  unstressed  nature  of  pronouns);  vocabularies 
and  subvocabularies;  phonetic  variety  and  balance  (dc  both  voiced  and  unvoiced 
consonants  follow  a  vowel,  etc.);  and  effects  of  error  in  pronunciation  ani  analysis 
(which  analysis  errors  may  give  the  most  lexical  errors  in  bottom-up  analyzers, 

etc. ) • 


With  so  many  dimensions  to  be  independently  controlled  and  tested,  the  diffi¬ 
cult  problem  is  how  to  keep  the  set  of  designed  utterances  to  a  reasonable  size. 
Among  techniques  being  incorporated  to  restrict  the  data  set  is  the  obvious  pro¬ 
cedure  of  incorporating  within  a  single  sentence  several  contrasts  that  are  still 
distinguishable  and  isolated  (by  their  distance  apart  in  the  sentence,  for  ex¬ 
ample).  Another  procedure  is  to  first  study  those  contrasts  which  are 'most  likely 
to  affect  performance  of  speech  unders banding  systems,  leaving  subtleties  for 
later  study. 

5 , 3  Guidelines  to  Use  of  Prosodies  in  Speech  Understanding  Systems 

As  part  of  a  continuing  effort  in  coordinating  the  Univac  studies  with  acti¬ 
vities  of  other  AKPA  contractors,  Wayne  Lea  is  preparing  a  tutorial  for  presentation 
at  a  forthcoming  meeting  of  AKPA  contractors.  This  tutorial  deals  with  aspects  of 
prosodies,  syntactic  structure,  and  semantics  which  are  of  interest  to  systems 
builders.’  It  is  being  coordinated  with  other  presentations  to  be  given  by  Dennis 
Klatt  of  Massachusetts  Linstitute  of  Technology,  Mike  O'Malley  of  the  University 
of  Michigan,  and  June  Shoup  of  Speech  Communications  Research  Laboratory.  These 
tutorials  will  collectively  summarize  many  aspects  of  acoustic  processing,  acous¬ 
tic  phonetics,  phonemics,  morphophonemics,  prosodies,  syntax,  semantics,  and 
pragmatics. 
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Plans  are  also  being  made  for  using  prosodic  features  tc  aid  distinctive- 
features  estimation  routines,  syntactic  parsers,  and  semantic  processors,  using 
the  ARPANET  for  access  to  programs  at  facilities  of  other  ARRA  contractors. 

In  general,  these  efforts  with  prosodic  aids  to  speech  -understanding  systems 
are  part  of  a  general  strategy  to  use  the  most  reliable  information  in  the 
acoustic  data  in  conjunction  with  early  use  of  linguistic  regularities.  Prosodic 
features  are  expected  to  play  a  crucial  role  in  such  a  strategy  for  sentence 
recognition.  Their  effectiveness  will  depend  upon  how  veil  they  are  incorpor¬ 
ated  into  total  systems  for  speech  understanding. 
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6.  CONCLUSIONS 


Research  on  prosodic  aids  to  speech  recognition  is  still  in  progress.  Con¬ 
clusions  at  this  point  thus  necessarily  are  confined  to  a  synopsis  of  the  general 
strategy  and  motivation  of  the  work,  and  a  few  preliminary  judgments  from  the 
limited  stress  perception  data  and  boundary  detection  results. 

Linguistic  and  perceptual  arguments  clearly  suggest  the  value  of  early  use  of 
syntactic  hypotheses  in  recognition  routines.  Prosodic  features  can  provide  cues 
to  the  presence  of  syntactic  constituent  boundaries  and  to  the  stress  pattern 
of  the  spoken  utterance.  In  particular,  fall-rise  patterns  in  voice  fundamental 
frequency  contours  mark  major  constituent  boundaries.  Sentence  boundaries  are 
usually  marked  by  pauses  followed  by  large  increases  in  fundamental  frequency  in 
the  beginnings  of  new  sentences.  While  prosodic  features  of  fundamental  frequency, 
intensity,  and  phonetic  durations  are  known  to  be  associated  with  English  stress 
levels  in  isolated  utterances,  the  best  acoustic  correlates  to  use  in  automati¬ 
cally  determining  stress  levels  in  connected  speech  must  still  be  found.  Techni¬ 
ques  for  using  prosodic  features  in  aiding  distinctive  features  estimation,  syn¬ 
tactic  parsing,  and  semantic  representation  are  yet  to  be  Implemented  and  tested. 

At  Univac  2SD,  the  basic  strategy  for  acoustic  speech  analysis  is  to  locate 
the  reliable,  clearly-encoded  prosodic  and  distinctive  features  in  the  acoustic 
data,  and  incorporate  them  immediately  with  linguistic  regularities  to  provide  the 
data  for  generating  syntactic  hypotheses,  lexical  identifications,  and  semantic 
judgments.  One  general  procedure  being  considered  is  to  locate  boundaries  be¬ 
tween  major  grammatical  constituents,  find  the  stressed  syllable(s)  in  each  consti¬ 
tuent,  and  do  a  partial  distinctive  features  analysis  within  the  stressed  syl¬ 
lables,  where  distinctive  features  are  expected  to  be  most  clearly  and  consis¬ 
tently  manifested. 

The  program  for  detecting  constituent  boundaries  from  fundamental  frequency 
contours,  as  implemented  at  Univac,  appears  to  give  satisfactory  results,  although 
it  must  still  be  tested  with  further  texts,  including  questions  and  commands,  and 
it  may  profit  from  refinements  to  eliminate  false  alarms  at  some  consonant -vowel 
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boundaries  and  to  incorporate  energy  cues  within  the  algorithm. 


Basis  analysis  tools  needed  for  the  partial  distinctive  features  estimation 
have  been  implemented.  These  include  linear  predictor  analysis  and  formant 
tracking.  Preliminary  attempts  at  restricted  distinctive  features  estimation 
have  been  incorporated  into  previous  large-vocabulary  word-recognition  or 
phrase-recognition  schemes  (Medress,  1972),  but  much  more  is  to  be  done  to  devise 
adequate  distinctive  features  estimation  schemes  suitable  for  use  in  the  stressed 
syllables  of  connected  speech. 

To  find  the  stressed  syllables  wherein  the  major  distinctive  features  esti¬ 
mation  effort  will  be  concentrated,  more  must  be  learned  about  stressed  syllables 
in  connected  speech.  The  experiments  with  the  Rainbow  Script  are  designed  to 
interrelate  linguistic  predictions,  perceptual  judgments,  and  acoustic  data  about 
stressed,  unstressed,  and  reduced  syllables  in  connected  speech.  A  syntactic 
analysis  and  subsequent  application  of  published  stress  rules  will  yield  testable 
predictions  about  normative  stress  patterns  for  the  script.  These  will  be  compared 
with  perceptual  judgments  and  with  acoustic  features,  including  several  para¬ 
meters  of  fundamental  frequency  contours,  energy  contour  parameters,  durations, 
and  vowel  quality.  However,  considering  the  frequently  discussed  gap  between 
ideal  linguistic  competence  and  actual  listener  or  talker  performance,  the  theore¬ 
tical  predictions  are  not  to  be  taken  as  the  standard  for  determining  acoustic 
correlates  of  stress.  The  performance  of  listeners  who  judge  stress  level  based 
on  what  they  hear  in  the  speech  signal  may  be  argued  to  be  a  better  standard  for 
assessing  acoustic  correlates. 

As  of  October,  1972,  the  acoustic  data  (contours  of  fundamental  frequency 
and  energy,  digital  spectrograms,  and  formant  tracks)  had  been  obtained  for  six 
talkers  reading  the  Rainbow  Script.  The  syntactic  analysis  had  not  been  done,  but 
some  of  the  perceptual  data  had  been  gathered. 

The  partial  results  from  some  listeners'  judgments  as  to  whether  syllables 
are  stressed,  unstressed,  or  reduced  suggest  several  preliminary  conclusions. 

When  listeners  heard  clausal  portions  of  the  text  repeated  at  will  (by  tape  rewind 
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and  replay)  or  digitized,  stored,  and  repetitively  retrieved  and  D/A  converted, 
they  were  able  to  distinguish  individual  stressed,  unstressed,  and  reduced 
syllables.  The  close  correspondence  between  the  average  performance  of  five 
listeners  and  the  performance  of  one  listener  (WAL)  showed  that  listener  to  be 
representative  of  the  set  of  listeners. 

Listener  WAL  performed  the  test  for  all  six  talkers  under  three  conditions: 

(l)  when  tape  rewind  was  used  and  only  "stressed"  and  "reduced"  syllables  were 
marked  (making  those  unmarked  be  "unstressed"  by  default;  cf.  Hughes,  Li,  and 
Snow,  1972);  (2)  when  the  digitizing  and  computer  replay  method  was  used,  and 
each  syllable  was  marked  as  either  "highly  stressed",  "lesser  stressed"  un¬ 
stressed,  or  reduced;  and  (.3)  when  each  syllable  must  be  marked  as  stressed, 
unstressed,  or  reduced,  and  the  tape  rewind  method  was  used.  The  three  condi¬ 
tions  gave  similar  results  for  syllable-by-syllable  judgments,  except  that  when 
each  syllable  was  not  necessarily  marked,  many  syllables  were  apparently  "unstres¬ 
sed"  by  default  (that  is,  they  perhaps  should  have  been  marked  reduced  or  stressed, 
but  they  were  "missed"  in  the  process  of  marking  only  stressed  and  reduced  syl¬ 
lables).  Most  "highly  stress"  and  "lesser  stressed"  syllables  from  the  computer- 
replay  run  were  judged  as  "stressed"  in  the  three-level  tape-replay  run. 

Results  did  differ  somewhat  from  talker  to  talker,  and  are  affected  by  the 
individual  listeners'  judgments,  as  might  be  expected.  Yet,  there  was  consider¬ 
able  agreement  about  the  stress  levels  of  many  syllables,  regardless  of  talker 
or  listener  (cf.  Hughes,  Li,  and  Snow,  1972).  For  example,  when  listener  WAL 
marked  stressed,  unstressed,  and  reduced  syllables  for  each  talker,  using  the 
tape  rewind  method,  the  total  results  of  comparing  the  other  five  talkers  to 
ASH  (as  in  Figure  9f,  page  49)  showed  that  only  16$  of  all  judgments  for  ASH 
differed  from  those  for  any  of  the  other  talkers.  Over  two  thirds  of  these 
"confusions"  were  between  unstressed  and  reduced  syllables. 

Of  the  127  syllables  in  the  Rainbow  Script,  about  50  to  60$  were  judged  by 
listener  WAL  to  be  stressed,  (depending  upon  the  talker),  about  35  to  46$  were 
Judged  as  reduced,  and  10  to  25$  were  judged  as  unstressed.  The  strategy  of 
doing  a  partial  distinctive  features  analysis  in  only  stressed  syllables  thus 
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would  lighten  the  load  on  acoustic  analysis  considerably,  while  still  allowing 
acoustic  analysis  on  enough  syllables  that  considerable  distinctive  features 
data  can  be  available  for  guiding  lexical  hypothesis  making  and  aiding  higher- 
level  structural  analyses. 

Further  perception  tests  are  to  be  made,  with  other  listeners  marking  all 
syllables,  and  with  repetition  tests  by  one  listener  under  identical  conditions. 

The  Rainbow  Script  is  recognized  as  not  being  ideal  for  isolating  and  study¬ 
ing  the  individual  effects  of  intonation  contour,  position  in  the  sentence, 
phrase  structure,  semantic  structure,  and  phonetic  content  on  stress  and  boundary 
results.  The  design  of  texts  to  isolate  such  factors  is  being  undertaken.  In 
addition,  Univac  will  be  evaluating  whether  speech  data  recorded  by  ARPA  systems 
contractors  will  demonstrate  systems  effectiveness  under  varieties  of  such  pro¬ 
sodic,  syntactic,  phonemic,  and  semantic  conditions.  Efforts  will  be  undertaken 
to  integrate  prosodic  information  into  other  programs  on  total  speech  understand¬ 
ing  systems. 
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